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In this Issue 

/£ ^ The HP Precision Architecture deveiopment program, known within HP 
as the Spectrum program, is the largest system development program ever 
undertaken by the Hewlett-Packard Company, The program developed not 
only a new system architecture, but also all hardware and software compo- 
nents necessary to constrtute an entirety new computer system family. It 
encompassed architecture, VLSI technology, the MPE XL commercial operat- 
ing system, the HP-UX real-time standard UNfX operating system, a new 
family of optimizing compilers, a new data base facility, and integration with 
the HP Advance Net networking strategy. 
The papers published m the August 1985 and January 1986 issues of the HP Journal outline 
the reasons for the development of HP Precision Architecture and describe the structure of the 
next generation compiler family. In this issue of the HP Journal, we are happy to be able to 
present the first of a planned set of papers that explain key program elements in greater levels 
of detail We intend these papers to be tutorial in nature, deschbing and explaining program 
elements and presenting the basic research and measurement resufts that were achieved. 

In this issue we begin with papers covering an overview of the processor architecture (page 
4), a summary of the 1 O architecture (page 23), a deschption of the performance analysis activities 
used throughout the program {page 30). and a description of the simulator tools that grew into 
our general software diagnostic tools (page 40). In subsequent issues, we plan to present papers 
describing hardware components, software system components, software engineering practices, 
and performance results. We expect that the collected set of papers will then constitute a good 
technical overview of the Spectrum program and the key research results that emerged from it. 

'Wifliam S. Worley, Jr 
Guest Editor 
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The cover photograph shows a "block diagram' representing the HP Precision Architecture 
execution engine, which is shown more conventionally in Fig. 3 on page 7. 



What's Ahead 

Next month s issue will have a series of articles on the design of the HP 9000 Series 300 
modular engineering workstations, and a part historical, part tutorial treatise on impfementing a 
worldwide electronic mail system, based on HPs experience with its own HP DeskManager 
product. 



The HP^Qurnat ancoursigas lechmcii discussiOin ol thfi topees press ntad m fecent artic)&s and will publish Jell a f3 expected to be ol mt&raEt to Qur ragda's. LettBrs must £ms bfief and aresy&J^t 
to ed'itiiiQ, Letters ghouJd be eddrijss&d to Editor. Hewieil-Packara Joufnai,. 3OO0 Hanover Street PaJo Alto. CA 9*304. US. A. 



AUSUST 1905 HEWLETT-PACKARD JOURNAL 3 
)Copr. 1949-1998 Hewlett-Packard Co. 



Hewlett-Packard Precision Architecture: 
The Processor 

This article describes the architecture's basic organization, 
execution modei control flow model, addressing and 
protection model, functional operations, and instruction 
formats and encoding. 

by Michael J, Mahon, Ruby Bei-Loh Lee, Terrence C. Miller, Jerome C. Huck, and William R. Bryg 



'*Ever>^hing should be made as simple as possible » hut not 
simpler." AJbert Emsteiu 

THE HP PRECISION ARCHrTECTURE development 
program had the objective of designing a computer 
architecture capable enough and versatile enough to 
excel in all of Hewlett Packard's computer markets: com- 
merciah engineering and scientific, and manufacturing. 
Such an architecture would h?ive to scale easily across a 
broad performance range, provide for straightforward mi- 
gration of applications from existing systems, and serve as 
the architectural foundation for at least the next decade of 
product development. 

To address this problem^ an unusual group of peopie 
was brought together, from within and outside Hewlelt- 
Packard, possessing unusually diverse experience and 
training. Under the leadership of Bill Worley, this small 
group of compiler designers, operating system designers, 
performance analysts, hardware designers, microcoders, 
and system architects was forged into a team. The Intent 
was to bring together many different perspectives, so that 
the team could deal effectively with design trade-offs that 
cross the traditional boundaries betw^een disciplines. 

The design methodology was as unusual as the team. It 
was an iterative, closed-loop, measurement-oriented ap- 
proach to computer architecture. The process began with 
data collection and analysis of what computers — Hewlett- 
Packard's and others'— were actually doing during applica- 
tion execution. Early results validated the suggestions of 
some RISC architecture researchers that simpler designs 
were a better match to the actual behavior of machines, 
and could substantially improve cost/performance,^ The 
scalability and generality requirements provided further 
incentives to reduce system complexity. 

After a simple "core" architecture was postulated, the 
team examined it intensively through simulation and mea- 
surement. We evaluated its suitability as a target for com- 
pilation and optimization, and as a host for modern operate 
ing systems. Logic designs were done simultaneously in 
several circuit and packaging technologies to evahiate the 
implications of the architectural decisions on hardw^are 
realizations. 

After a round of evaluation, the results became the basis 
for a series of proposed refinements to the architecture. 
After critical study, the best proposals were incorporated 



into the architecture, the simulator w^as updated, and the 
evaluation process began again. 

This process continued for four major [and many minor) 
iterations over a period of 18 months. At each successive 
iteration, the architecture and all proposed changes were 
published internally for review by key technical people in 
product divisions. As the project progressed, an increasing 
proportion of the proposals and evaluations came from divi- 
sional participants. 

The iterative design sometimes resulted in adding a func- 
tion. For example, the frequent requirement to shift index 
registers to index to half words, words, or double words 
in a byte- ad dressed machine led to the addition of a zero-to- 
three-bit preshifter to scale one of the inputs to the adder. 

More frequently, iteration resulted in deleting mechanisms 
revealed as too onerous or too little used. An example is 
the deletion of the STORE INDEXED instruction, because it 
was the only instruction that would have required a register 
file capable of reading three registers simultaneously. Com- 
piler strategies were found that all but eliminated the need 
for the STORE INDEXED instruction, which in any case could 
be simulated in two instructions. Another example was 
the deletion of a rather irregular MULTIPLY STEP instruction, 
w^hen it w^as discovered that virtually all integer multipli- 
cations could be performed efficiently using SHIFT AND ADD 
instructions, which were a natural byproduct of the index 
preshifter described above. 

The result of this process is an architecture honed by 
data, tested against various implementation technologies, 
and broadly tuned to a wide variety of system and applica- 
tion tasks. 

Overview 

An HP Precision processor is one element of a complete 
system. The system also includes memory arrays, I/O de- 
vices, attached processors, and interconnection structures 
such as buses and networks. Fig. 1 show^s a typical system. 
The processor interfaces to a central bus like any other 
module and uses the bus to reference main memory and 
DO devices. External interrupts aje also transmitted over 
the buses. 
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Processor Overview 

The processor module is organized as instruction fetch 
and execute units with a tightly coupled high-speed cache 
system. While a cache Is optional, it is such a cost-effective 
component that nearly all processors will incorporate this 
hardware. The processor module may also include a 
hardware address translation table called a translation 
lookaside buffer or TLB, and assist hardwarfi for extra func- 
tions such as floating-point operations. The main data paths 
are 32 bits wide, and the memory system is byte addressed. 

The execution unit performs data transformations on 
local registers and generates addresses to reference the 
cache and main memory. It has a memory system interface 
for moving data operands between the memory system and 
the registers. The execution unit may be supplemented by 
assist hardware — coprocessors or special function units — 
to augment its capabilities for application-specific opera- 
tions or data types. This is discussed further in the sections 
on the execution model. 

The fetch unit calculates the instruction address, fetches 
the instruction, decode.*? it, and sends information to the 
execution unit. The fetch unit greatly benefits from a re- 
duced-complexity instruction set. In.structions are all fixed- 
width 32-bit objects, suTiplifying decoding and calculation 
of the next instruction address. The fetch unit is responsible 
for the control flow in the processing of instructions. This 
is discussed further in later sections on the control model. 

HP Precision Architecture uses a memory hierarchy as 
a cost-effective means of achieving nearly the speed of the 
fastest fhighest] memory level, with the capacity of the 
largest (lowest) memory leveh The highest level of the 
hierarchy is the registers, follow^ed by the caches. Main 
memory is the next level and the I/O system provides the 
largest and slowest level of storage. In HP Precision Ar- 
chitecture, the cache system is architecturally visible in 
the sense tJiat there are cache control instructions for cache 
management. A virtual memory system is a characteristic 
feature on all but the smallest HP Precision processors. 
Virtual address protection and translation provide security 
and a large, flat, global address space for all processes. This 
is discussed further in the sections on the addressing and 
protection modeL 



Provisions are made for attached processors, which inter- 
face to the system hierarchy at the memor>' bus level, and 
typically have their own registers and local cache system* 
Attached processors can provide such functions as I/O or 
vector processing. Clustered and lightly coupled multipro- 
cessing are also supported for modular expansion of the 
system. 

Processing Resources 

7' he processing resources are organized around three reg- 
ister arrays and a few specialized registers [see Fig. 2). The 
general register array contains general-purpose registers 
used for ail computations. The space register array is used 
to build virtual addresseB. The control register array is a 
collection of registers used for virtual address protection, 
interruption processing, and other miscellaneous functions. 

The general register array contains thirty-two 3 2 -bit gen- 
eral-purpose registers. Register zero is special: it always 
returns zero when read and discards any result when used 
as a target register. This specialization is easily implemented 
in hardware and eliminates the need for instructions for 
unary or condition-testing operations. For example, a copy 
oporation is a logical OR with register zero and unary SUB- 
TRACT also uses register zero as a source Registers 1 and 
31 are also special i/.ed as implied targets for a few instruc- 
tions that have no space in the instruction for target register 
specifiers. 

The space register array contains eight registers. When 
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Fig* 2. HP Precfston Archftecture processing resources are 
organized around three register arrays and a few speciafized 
registers 
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one of these is concatenated to a 32-bit address offset, a 
virtual address is formed. Three levels of the architecture 
are defined, according to the amount and degree of vir- 
tual addressing supported. The level-zero HP Precision pro- 
cessor does not support any virtual addressing and need 
not implement the space registers. When building a (proces- 
sor for a highly integrated h dedicated system, it is a consid- 
erable savings In hardvtare cost to eliminate the virtual 
address hardware. General-purpose computers, however, 
require virtual addressing. A level-one processor supports 
16-bit space registers for a 4S-bit virtual address space and 
a level-two processor implements 32-bit space registers to 
allow the full 64-bit virtual address spacen 

The control register array consists of twenty-five registers 
which contain system state information. Four of these con- 
trol registers are used by the virtual address system to iden- 
tify protection groups for the current process. The shift 
amount for instructions that perform variable-length shifts 
is stored in a control register. An interval timer is included 
as a control register. The configuration of coprocessors in 
a system is also stored in a control register. The remaining 
control registers are used as temporary registers and to 
record the state of the machine at the time of an interrup- 
tion. 

An HP Precision processor also maintains registers for 
the current instruction address, the current Instruction, 
and the processor status word (PSW). The current instruc- 
tion address is divided into its virtual space identifier (IAS) 
and its offset (lAO) within the space. The instruction regis- 
ter (IR) contains the current instruction. The PSW holds 
various flags for enabling virtual addressing, protection* 
interruptions, and other status information. 

Fig. 2 shows the processing resources. A complete con- 
text switch only involves the saving of the general registers, 
the space registers, and several of the control registers. The 
instruction address registers and PSW are saved in control 
registers by the hardware at the time of any interruption. 
Since the process state is small and no extra manipulation 
of cache or TLB (translation lookaside buffer) structures 
is necessary, fast context swutcbing is obtained. No addi- 
tional resources are needed to save intermediate machine 
states, since interruptions are always taken at instruction 
boundaries. 

Data Types 

UP Precision Architecture supports data types for arith- 
metic, logical, and field manipulation operations. All data 
objects must be stored on their naturally aligned addresses, 
that is. 32-bit data objects must start on word-aligned (four- 
byte) addresses ^ 16-bit data objects must start on half-w^ord- 
aiigned addresses, and 8 -bit data objects must start on byte- 
aligned addresses. This general alignment rule is easily 
obeyed by software and significantly improves the cost and 
speed of cache memory hardware. It also eliminates the 
possibility of a cache miss or address translation fault in 
the middle of a data or instruction referencBn thereby 
simplifying the processor control. 

Signed and unsigned integers may be 8. IB. or 32 bits 
long. Signed integers are represented in two's complement 
form. Characters are 8 bits long and conform to the ASCII 
standard* While bits are not directly addressable, efficient 



support is provided to manipulate and test individual bits 
and bit fields in general registers. Both packed and un- 
packed representations of decimal numbers are supported 
by software. Packed data is always aligned on a word bound- 
ary and consists of 7, 15. 23, or 31 BCD digitSn followed 
by a sign digit. 

Floatini^-point numbers are addressed as 32-bit (single- 
precision) or 64-bit (double- precision) quantities. The co- 
processor interface allows this wider data path for loading 
and storing double-precision floating-point operands. The 
floating-point data formal conforms to the ANSl/lEEE 754- 
1985 standard. 

Execution Model 

HP Precision Architecture assumes a register-based 
execution model, with all operands coming from registers 
and all results going back into registers. The thirty-two 
general-purpose registers are used for local storage of 
operands, intermediate results, and addresses. 

The execution engine for the basic HP Precision instruc- 
tion set consists of a simple arithmetic logic unit (ALU) 
and a shift-merge unit (SMU), as shown in Pig. 3. The ALU 
has a preshifter on one port and a complementer on the 
other port. The SMU consists of a shifter and a mask-merger. 
U IS used for implemt^nting field manipulation operations. 
The shifter concatenates two 32-bil operands and performs 
a right shift. The mask-merger selects a contiguous field of 
bits from the output of the shifter and merges this with the 
other bits from its second input source, forming a 3 2 -bit 
result. The second input source to the mask-merger may 
be a mask of ail Keros or all sign bits^ or may come from a 
genera] register. 

The typical execution data fh>w consists of reading two 
operands from general-purpose registers, routing these two 
operands through the ALL! or the SMU with the proper 
function selected, and storing the result back into a general 
register. This is the data flow for the basic three-register 
model of execution, which facilitates single-cycle execu- 
tion, since no memory references are required. 

Single-Cycle Execution 

A primary design goal was that all functional computa- 
tions in the basic instruction set could execute in one 
machine cycle in a pipelined implementation of the proces- 
sor architecture. Operations were selected for inclusion in 
the basic instruction set only if they could be implemented 
in a reasonably small number of logic levels, to guarantee 
a short cycle time. This does not necessarily mean that the 
operation performed had to be primitive in function. In 
fact, rather sophisticated operations were ailow^ed in the 
architecture if they proved useful to the compilers, and 
were implementable in a short machine cycle with rela- 
tively simple hardware. 

Complex operations that are necessary to support re- 
quired software functions but cannot be implemented in a 
single execution cycle are broken down into primitive op- 
erations, each of w^hich can be executed in a single cycle. 
Elxamples are the DECIMAL CORRECT operations which are 
primitive operations for performing arithmetic on BCD 
data, the shift AND ADD operations which are primitives 
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Fig, 3, Th^ BMecutian data path consists of a strnpie arith- 
metic logfc unit (ALU) and a shift-merge unit (SMU) 

for integer multiplication, and the DIVIDE STEP operation 
which is a primitive for integer division. 

Single-cycle execution was a design goal of the architec- 
ture* but is not a conslrainl on the impiementations. For 
example, an HP Precision mir.roprocesfior may operate with 
slower memories, performing a load instruction in more 
than one cycle. 

Jmnnediates 

A notable aspect of HP Precision Architecture's register- 
based model of execution is its heavy use of the instruction 
register as a source for operands, in addition to the thirty- 
tivo generE^i-purpose registers. Many HP Precision instruc- 
tions have an immediate field embedded in the 32-bit fixi^d- 
length instruction. These immediates are made maximal- 
length, in the sense that they fill up <jII unassigned bits in 
the given instruction. This maximi/:es the probability that 
a constant can be represented in the iriiitruction as im- 
mediate data. Although immediates come in various sizes 
in different instruction classes, their sign bit is always in 
a fixed position. An immediate operand is advantageous 
since it does not have to be loaded to a general register and 
therefore saves both a memory access and the use of a 
general register. 

Although maximal-length immediates in an instruction 
are capable of representing most of the constant values that 
are needed, it is desirable to have the capability of embed- 
ding full-lengtb 32-bit immediates in the instruction 
stream. HP Precision Architectures does this by means of a 
pair of instructions. First, a long-immediate instruction is 
used to load or add thr; most sigoifir.ant twenly-one bits of 
the immediate value, padded on the right with eleven zeros, 
into a general register. A subsequent instruction, using this 
register as the base register, supplies the low-order bits to 
complete the 32-bit hnmediate. In this way, a 32-bit con- 
stant value can be placed in a general register, or a load or 
store instruction can be performed with a full 32-bit static 
displacement. An alternative approach — c;reatinga double- 
word instruction — would have introduced the more com- 
plex possibility of a page fault ot;curring In the middle of 
an instruction fetch. 



Load and Store Operations 

The genera] regiHter array is the only level of the memory 
hierarchy that mteracts with the execution engine. The 
general registers interact with the rest of the memory hierar- 
chy via the LOAD and STORE instractions. 

The LOAD and STORE instructions are designed to ex^ute 
in a single cycle in a pipelined implementation of the ar- 
chitecture that includes a data cache memory^ that operates 
at the speed of the processor. This immedialely excludes 
the specification of multiple loads and stores or levels of 
address indirection in a single instruction. 

Even with a fast cache memorys data may not be available 
until one cycle after the memor>' access is initiated. There- 
fore, following a load instruction, the software tries to 
schedule one or more instructions that do not use the target 
register being loaded. However, the hardware must be able 
to interlock the pipe if an instruction following a load 
instruction uses the larget register that has not yet been 
loaded. 

The si/se of the data item loaded or stored can be a byte, 
a half word, or a full word. It is possible to store any con- 
tiguous sequence of bytes within a word, either starting 
from the leftmost byte or ending with the rightmost byte, 
using the STORE BYTES instruction. For example, it is pos- 
sible to store the leftmost three bytes or the rightmost three 
bytes of a register into three contiguous bytes of memory. 
This Instruction is a useful primitive for moving unaligned 
strings of bytes from one memory location to another. 

All address calculation in the LOAD and STORE instruc- 
tions is based on the base register plus displacement ad- 
dressing mode. The displacement can be a long 14-bit 
signed displacement, a short 5-bit signed displacement, or 
an index register. An index register, if used, may optionally 
be shifted left by 1,2, or 3 bits to permit integer addressing 
to half words, words, or double words, respectively. Both 
the base register and the index register used in address 
calculation can come from any of the general registers. 
Flexible Address Modification Mechanisms. Automatic ad- 
dress modification mechanisms allow one to walk through 
3 data structure more efficiently^ by updating the address 
register to the next item in the data structure lo be reff^r- 
enced while fetching the current item, 

Flexible address modific:ation mechanisms are included 
in HP Precisirm Architecture, providing high-performance 
functionality (n a single cycle. For example, it is possibie 
to modify the liase register for a subsequent load or store 
instruction by adding to it the long or the short displace- 
ment value specified m the instruction itself, or the value 
of an index register, optionally shifted to multiply by the 
size of the object to be loaded or stored. 

If address modification is specified, either prf^modilica- 
tion or postmodification can be performed, Pww<nilflca- 
lion means that the address calculation is performed and 
the result used as the address to initiate the memory access. 
Poatmodification means that the original content of the 
base register is used as the address to initiate the memory 
access. 

An unusual feature of this premodify or postmodify ad* 
dressing mode is that in fhi! long-displacement iostruc* 
tions, the sign bit of the displacement is also used as the 
bit to select premodification or postmodification. This al- 
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Floating-Point Coprocessor 



HP PrecfeSfon Archileclure generally conforms to the concept 

of a Simple instruction set realizable m cost-effective hardware 
However ceftain algorithms Nke floating-point operations realize 
subslantial performance gams when [mpJemented on specialrzed 
hardware The float ing-pofnt insirtiCtion set is an example of HP 
Precision Arcfiifecture's instruction extension capabilities. 

Floating-point instructions are supported through an assist 
coprocessor to provide high performance numeric processing 
As a coprocessor, the floating-point unit contains its own register 
file and executes concurrently WFth the basic processor 
Operands from the caches are loaded or stored trom any of 
twelve floating-point registers. The data format, all operations, 
and exceptions fully conform to the ANSI/IEEE 754-1985 stan- 
dard. Very high-performance coprocessors can be implemented 
by combining hardware pipelining with the HP Precision high- 
level language optfmizer 

The fsoattng-point coprocessor is organized like the basic pro- 
cessor All operands from main memory are referenced using 
coprocessor bad and store insiructions Normal virtual address 
translation and protection checks are made and data is trans- 
ferred between Ihe cache (or memory) and the floating-point 
regisrer file Both single^prec?Sion (4.byte) and double- precision 
(8-byte) operands can be reterenced with a single instruction. 
Quad-precision (16-byte) operands are referenced using a pair 
of double- precision coprocessor memory reference instructions 

The basic processor pertorms index ana snon-aisptacement 
address calculations for the coprocessor load and store inslruc- 
tions. While store indexed instructions are not provided for the 
basic processor, cophocessor store indexed instructions are 
provided since only two general register reads and a noncor^flici- 
ing coprocessor register read are required. 

Floating-Point Register Fite 

The register fite contains twelve 64-bit data registers, a 32 -bit 
status register, and seven 32-bit registers for reporting excep- 
tional condilions, as shown m Fig. i The iwelve data registers 
also form six 128-bit quad-prectston registers. The data registers 
are numbered from 4 through 15. Register holds the status 
register When register zero is used as the target or source of a 
coprocessor load or store, the status register Is referenced. But 
when used as the source of an operation, register zero returns 
a floating-point eero This is used for simple assignments, arith- 
metiG negation, and compansons with zero 
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The Status register holds information on the current rounding 
mode, the exception flags, and exception trap enables for the 
five IEEE exceptions overflow, underflow, divide by ze^o. invalsd 
operation, and inexact If I he exception irap is not enabled, then 
a default result is returned and the corresponding exception flag 
Is set in the status regrster If the exception trap is enabled, an 
internjphon to the mam processor occurs, with the exception 
and the instruction causing it recorded in an exception register 
On overffow. underflow, and inexaci exceptions, the correctly 
rounded result is delivered to the destination register. On invalid 
operation and divide -by-zero except pons, the source registers 
are preserved, Users can specify a trap handler for any of the 
five IEEE exceptions, using the information preserved, 

The coprocessor uses an additional nonmaskable exception, 
called unimpiemented. to pass off to software those operations 
not imptemented by the coprocessor hardware. T^e unim- 
plememed trap triggers a soflv^are emulation of the desired op- 
eration with the original operands. 

The Boolean result of a floating-point comparison is stored in 
a bit in the status word. This bit can conditionally nulfify the next 
instruction when tested No conditional branch is allowed. A con- 
ditional branch would have increased the critical path for branch 
determination 

Floating-Point Operations 

The f loatJng-point coprocessor de^nes eleven fundamental op- 
erations in three precisions. All of the operations, except for 
conversions to fixed -point formats, produce floating -point results. 
Source and destination formats are the same except for conver- 
sions that have explicit source and destination formats Rounding 
is specified by a mode field in rhe sratus register The COPY and 
ABSOLUTE VALUE operatroos are nonanthmetic and do not cause 
exceptions The following table summarizes the defined arithme- 
tic operations tor singEe. double, and quad formats 
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CONVERSION instructions from floating-point formats to fixed- 
poinl formats and between floating-pomi formats are also in- 
cfuded. When converting from floating-point to fixed-pomt format, 
the current rounding mode can be temporanly changed to round^ 
to-zero. Many programming languages define conversion to in- 
teger as rounding to zero In accordance with the standard, the 
default rounding mode is rounding to the nearest integer. 

Scalability and Performance 

HP Precision Architeciure is designed to adhere strictly to the 
IEEE tioating-point standard, The standard does not. however. 
require that aJl floating-point operations be performed in high-per- 
formance hardware, and does not specify f he instruction set level 

presentation of the hardware Whenever there is little perfor- 
mance advantage lo be gamed by performing an operation m 
hardware, consideration should be given to simplifying the 
hardware and performing the operation in software The unim- 
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p'remented exception irap mechanism is emptoyed to avoH3 nsn- 
dfif^rg opefal'orss and except lOfsal conditions in hardware 

The Simplest HP Precisjon systems ^r^y completely exciude 
a fioslirig-point unti Each floating -po^nl instruction causes an 
asstsi emulation trap and system software completely simulates 
the functton Special control registers speed the simulation of 
load and store instructions. Some implementations can reoiice 
the complexity of hardware control by supporting only those 
operations that useava«fableftoatjng-pojnt hardware In this case 
excepltorsal conditions arise that can require addrtional process- 
ing or software assistance For example ttie unFmpiemented ex- 
ception trap mechanism can be used to handle the square root 
operation and corner case opera r^s like infinities. NaNs {not a 
number), and denormahzecJ numbers. 

The flaattng-point coprocessor is architected to be pipelined 
to allow very htgh-perfomnance numertc processing Fundamen- 
tai to this IS the delaying of exception reporting It the coprocessor 
must inform the basic processor immediately that the current 
instf uctton overljows, then little concurfent processing and pipe- 
lining IS possfble In HP Precision Architecture, the coprocessor 
can freely accept a n on- load "Store operation independent of any 



earlier operations. pfOvioed space exists in ttie exception regfrs- 
tefS to report except+ons. This allows seven (nsttuctions to be in 
execution simultaneously while the bas^ processor continues. 
Load and store instructKDns to mdependeni data registers can 
also be fuHy overlapped The coprocessor' need only complete 
pipelined instructions when the result is being requested Refar- 
er>ces to the status register are specs a3 and require all operations 
to be completed 

A minimally pipelined machine mtght perform only a single 
floating-point operation at a time, but permit load and store op- 
eratmns !o execute concurrently Thts requires an interlock 
agatnst stores of the single result register specffied m the execut- 
ing operation, and an interlock on the source registers durtng 
the pefiod thai the source exceptions are tested tn the operations 
The second interlock may never occur insomeimplementatjons 

The floating-point instruction set is designed to allow software 
the option of performing pipelined operations without the need 
for complex hardware control. The htgh-ievei language optimizer 
places instructions in a sequence to avord the most common 
interlocks The use of results is delayed as long as possible and 
effective overlap with other integer operations is obiamed 



lows the specification of premodification or postmodifica- 
tion without using up a bit of the long displacement field. 
Memory accesses with loag displacement fields perform 
predecrement or postincrement, depending on the sign of 
their displacements. In theory, this is less general than 
allowing the specification of premodification or post- 
modification to be orthogonal to the sign of the displace- 
ment, as is true for the .shorted isplacement load and store 
instructions. In practice, however, the feature works very 
well for maintaining stacks stared in the memory. For 
example* for a stack growing in the direction of decreasing 
memory addresses, pushing onto the stack frcmi a register 
is done by a store with predecrement and popping off the 
stack is done by n load with postincremnnt. 

Combined Instructions 

The basic types ai opr^rsitions in most instruction sets 
fall hito throe categorins: data transformation nptirations, 
data movement operations, and control operations. In gen- 
eral, one instruction performs one of these nperations. A 
combined insfrucfion performs more than one of these op- 
erations in one instruction. In HP Precision Architecture* 
almost every instrurition performs a combination of two of 
these operations in a single cycle, with relatively simple 
hartlware. 

HP Precision Architecture has two types of data transfor- 
mation and f:onlml operation combinations. The first typo 
has a more general transformation operation combiniHi 
with a restricted control operation, whereas the reverse is 
true for the second type. Examples fif the first type are ADD 
instructions that can conditionally skip the execution of 
I he fnllnwing instruction. Kxami)lp.s f if the second typn are 
COMPARE AND BRANCH iiLHtrut;tinns. 

The LOAD and STORE instructions combine a data move- 
ment operation (moving data bEJtween a grmenil register 
and the momory system] with a transformation operation 
[the acr:ompanying address calculation and modifif:ation). 

HP Precision Architecture's comb int^d instructions allow 
the execution engine to be used efficiently, since the data 



trfinsfurmation portion of a combined instruction is per- 
formed in the simple execution engine shown in Fig. 3. 

Assist Instructions 

The architecture aliows for flexible instruction set exten- 
sions by means of assist instructions. Assist instructions 
are Instructions in which the data movement functions are 
defined between the processor or the memory and the assist 
hardware, but the data transformation functions are left un- 
specified. An extendi mi instruction is defined by specifying 
in an a.ssist instruction Ibe data transformation operations 
to be performed by the assist hardware. Assist hardware 
is optional hardware that accelerates the execution of a set 
of assist instructions. In the absence of the assist hardware, 
an extension instrui:tion is emulated by software, using 
a transparent assist emulation trap mechanism. Critical in- 
formal Ion required for emulation is saved in control regis- 
ters, substantially reducing the emulation time. 

HP Precision Architecture allows up to sixteen assists 
in a system configuration, supporting sixteen logically dif- 
ferentiated sets of instruction set extensitms. These are di- 
vided into two generic types of assists; the special function 
units (SFUs) and the coprocessors (COPs]. 

Special function units use the general registers as sources 
and targets of operations. They are coupled very closely to 
the basic processor and its register buses. 

Coprocessors provide functions that use either memory 
IfM ations or coprocessor registers as operands and targets 
of operations. They are coupled less closely to the basic 
processor; Coprot:essors may also directly pa.ss double- 
word quantities between the coprocessor and the memory. 
This is suited to the manipulation of quantities that are 
loo large to be handled directly in the general registers^ 

The HP Precision instruction set can be extended by 
defining a set of assist instructions in applications where 
specialized hardware is justified by its frequency of use or 
by the resulting performant:e improvemeivL The architec- 
ture allows such instrutiUon set extensions without com- 
promising software compatibility. An example of such an 
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instruction set extension is the instruction set for the float- 
ing-point coprocessor (see box, page 8). 

Control Flow Model 

HP Precision Architecture defines a computer in which 
the flow of control passes to the next sequential instruction 
in the memory unless directed otherwise by branch instruc- 
tions, nullification of instructions, or interrupt ions. These 
three mechanisms can potent ialiy alter the sequential flow 
of contrt)] in instniction processing. i 

Branching 

The architecture has both unconditional and conditional 
branch instructions. All branch instructions exhibit the 
delayed branch feature. 

In a pipelined processor, it is difficuJtto execute a branch 
instruction in one cycle, since the branch target address 
has to be calculated before the target instruction can be 
fetched- Hence, taken branches frequently result in pipeline 
interlocks, in the absence of other prefetch mechanisms. 

To minimize such pipeline interlocks, HP Precision Ar- 
chitecture defines a one-instruction delayed branch. This 
means that a deJay instruction, which is the instruction 
following a branch instruction, is executed before the pro- 
gram control flow^ passes to the target instruction of the 
branch. The delay instruction is not executed when it is 
explicitly nullified by its preceding branch instruction. 
This branch nullification feature is explained later. 

The delayed branch mechanism allows compilers to 
schedule a useful instruction in the cycle during which the 
branch target address is calculated. For example, this might 
be an instruction that preceded the branch instruction. 
Unconditional Branching, HP Precision Architecture de- 
tines iocfjJ branches, where the control flow^ passes to 
another location within the current virtual space and f^xtEr- 
nol branches, where instruction processing continues at a 
location that may be in a different virtual space. 

The design of high-speed pipelines is simplified if branch 
target address eal eolations can be made before the execu- 
tion of the branch instruction itself. In HP Precision Ar- 
chitecture, the mo.<5t common branch instructions have 
branch targets calculated relative to the address of the 
branch instruction itself, with displacements given in the 
branch instruction. These are called re/at he branches with 
static dispJacements. Unconditional branch instructions 
have a 17-bit signed displacement field , and the conditional 
branches have a 12-bit signed displacement field. 

Although a 17 -bit displacement wil! cover almost all 
branch distances, it is insufficient in certain situations. 
Furthermore, it is not always possible or convenient to 
generate a static displacement at compile time for some 
branches. Hence, the architecture includes branch instruc- 
tions with 32-bit dynamic dispJacements specified by the 
contents of a general register. 

Branches are also needed to locations that have no rela- 
tion to the address of the hranc:b instruction — for example, 
to independent relocatable modules. This is called ab.soJufp 
branching, since the address of the target instruction can 
be anywhere in the address space, HP Precision Architec- 



ture aiso allows absolute branches: the branch displace* 
ment is added to the contents of a general register called 
the base register. 

Subroutine Calls. The subroutine call primitives are 
BRANCH AND LINK instructions, w^hich save the return ad- 
dress of the cal (in^ routine in n general register before trans- 
ferring the control flow to the subroutine. Both local (intra- 
space) and external (interspace) subrouHne calls are de- 
fined. The external subroutijie calls must save a larger re* 
turn pointer, indicating also the virtual space of the caller. 

The external BRANCH AND LINK instruction uses implicit 
link registers for saving both the caller's space identifier 
and the offset within that space. Space register zero (SR 01 
is used for saving the space identifier and general register 
thirty-one (GK 31) is used for saving the offset address. 
This permits the maximum number of bits to he used for 
encoding the static branch displacement. 

Subroutine returns are acf:omplished by using an abso- 
lute branch instruction, specifying the general register used 
to save the link address in the BRANCH AND LINK calling 
instruction. If appropriate software conventions are used, 
a uniform subroutine return sequence can be used for both 
local aofl external calls, 

Inter-Ring Branches. Four hierarchical protection rings are 
implemented in HP Precision Architecture, Each ring has 
a privilege level associated with it, the innermost ring 
(privilege level 0] being the most privileged ring and the 
outermost ring (privilege level 3) being the least privileged 
ring. 

The architecture defines unconditional branch instruc- 
tions that perform inter- ring crossings in one instruction. 
Three of these are outw^ard branches, causing a decrease 
in the process privilege level. Only one branch instruction 
(GATEWAY) is an inward branch, causing an increase in 
privilege level. 

Conditional Branching. Jn many architectures, conditional 
branching is accomplished by two separate instructions. 
The first instruction calculates a condition, and saves the 
result of this condition calculation in state flip-flops in the 
processor called a condition code. A subsequent condi- 
tiooal branch instruction may alter the program's control 
flow depending on the value of the condition code. 

Statistics of instruction sequences show that in an over- 
whelming majority of cases, a conditional branch instruc* 
tion is immediately preceded by the instruction that sets 
Ihe condition tested by the branch. HP Precision Architec- 
ture capitalizes on that fact by combining the two instruc- 
tions into one instruction J hus achieving code compaction, 
reduction of execution time, and elimination of condition 
code flip-flops in the processor state* Each conditional 
branch instruction includes a data transformation opera- 
tion, which generates a condition that is used immediately 
to determine whether the branch is taken or not. Such 
conditional branch instructions also provide greater oppor- 
tunities for an optimizer to reorder instructions, with less 
bookkeeping. 

There are four kinds of operations that can be executed 
wdth a conditional branch instruction. The ADD AND BRANCH 
in?^t ruction is useful for closing loops- The COMPARE AND 
BRANCH instruction is useful for closing loops and for if- 
then-ebe control structures. The BRANCH ON BIT instruction 
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allows branching on the value of any bit in a ^neraJ regis- 
ter. The MOVE AND BRANCH instruction is useful for 
remiUalizing a register before branching a%vay, 

HP Precision Architecture also implements a special nul- 
lification scheme to optimize the use of the delay instruc- 
tion following a conditional branch instruction. 



instruction www. 

For forward branches, the nullification definition allows 
shorter code sequences for if-then constructs* as shown in 
the following example. 



Nulimcatjon 

HP Precision Architecture defines a control flow feature 
called the nulJi/ication of the immediately following in- 
struction. When an instruction is nullified, it executes as 
a no-operation (NOP), and the effect is as if it had never 
been in the instruction stream. This means that no change 
in any architecturally visible slate, like general registers, 
memory, control registers, or space registers occurs because 
of a nullified instmction. A nullified instruction does not 
cause any traps to be generated, and it does not cause its 
successor instruction to be nullified. All branch instruc- 
tions and data transformation instructions have the ability' 
to nullify the instruction to be executed next. 

Ail branch instructions have a single-bit nullification 
field. An unconditional branch instruction can "always 
nullify** or "never nullify'' the execution of its delay in- 
struction by setting the value of the nullification field to 
one or zerOt respectively. A conditional branch instruction 
can "conditionally nullify" or *'never nullify'' the execu- 
tion of its delay instruction in the same manner. The never 
nullify feature is used whenever a delay instruction can 
be found that can always be executed, regardless of whether 
the branch is taken or not, 

A conditional branch is taken when the condition it 
specifies evaluates true. To optimize the use of the delay 
instruction following the conditional branch, the delay in« 
struction is nullified for backward branches only if the 
condition is falset and for forward branches only if the 
condition is true. Since the compilers use the convention 
that loops are closed with backward branches, the delay 
instruction of this branch can now be "inside" the loop, 
saving a cycle on each iteration. The following example 
illustrates this. 



XXX 
LOOPB: YYY 



> Loop body 



HZ 

COMBTC.n, 

XXX 



r1,r2, LOOPB; 



As shown, the first instruction (XXX) of the loop body 
can always be duplicated following the loop-closing 
branch, COMBT When the COMBT instruction is executed. 
if condition C is true, then the XXX instruction is executed 
and control passes back to LOOPB. Otherwise, the next in- 
struction (XXX) is nullified and processing continues with 



"If^* 
Code 



Then" 
Code < 



RRR 



sss 

COMBTC.n r>£,ryTHRU: 



^ uuu 
"mnu: vw 



When the conditionaJ branch instruction, COMBT, is exe- 
cuted , if condition C is true, the next instruction is nullified 
and the branch is taken around the "Then" code to the 
location THRU. Otherwise, the next instruction (TTT) is exe- 
cuted. 

Every data transformation instruction has an implicit 
conditional skip operation built into it. In a single cycle, 
the function specified by the transformation instruction Is 
performed by the execution engine ^ and a condition 
specified in the instruction is evaluated. If the condition 
evaluates true, then the next instruction lo be executed is 
nullified. If the condition evaluates false, then the next 
instruction is executed, or not nullified. 

The following example shows the use of nullification in 
an ALU instruction to Implement a compact control se- 
quence for a high-level language construct. 

High-level language: 

if [a < b] then b - b -H 1 ; 

Equivalent HP Precision assembly language: 

SUB,> = a.b.rO; Subtract IGRhJfrom jGR a], discarding 
the result, and nullify next instruction if 
fGRal^[GRb], 

'^g^ 1.b,b: Add the immediate value 1 to IGRbl. 

writing the resul t back to GR b. 

Conditional Trep* In some instructions^ the condition 
specified in the instruction is used to cause a conditional 
trapn rather than the nullification of the next instruction. 
An advantage of taking a conditional trap rather than con- 
ditionally nullifying a branch to a trap routine is that the 
majority of instructions do not incur the penalty of a nul- 
lified instruction. For example, when an add nr subtract 
instruction is used to perform range checking, the penalty 
of a conditional trap is taken only in the rare cases where 
the range check fails. 
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While it is a common feature of other archilectiires to 
have an ALU instruction trap on arithmetiu error conditions 
like overflow, it is a special feature of HP Precision Ar- 
chitecture to allow trapping on defined conditions that are 
not arithmetic errors. 

Assist NuUification. In assist n u 11 (fi cation > the condition 
upon which nullification is performed is generated by the 
assist hardware rather than hy the basic processor Instead 
of defining assist branch instructions, the processor's un- 
conditional branch instructions are used for control flow 
changes in assist programs. The equivalent of conditional 
branching is achieved using a pair of instructions: a data 
transformation assist instruction with its nullification field 
set to one. followed by an unconditional branch instruc- 
tion. The assist instruction generates a condition that deter- 
mines whether the following branch instruction should be 
nullified. 

An assist can be defined with the nullification oper- 
ation dependent upon the condition generated either in 
the current assist instruction or in the previous assist in- 
struction. The latter is called dr^lay^d nullification. 
Delayed nullification allows other instructions, executed 
by the basic processor or other assists, to be scheduled 
during the time the assist hardware is performing a lengthy 
computation that generates the condition for determining 
nullification. 

Interruptions 

Interruptions areanomaUes that occur during instruction 
processing, causing the control flow to be passed to an 
interruption handling routine. In the process, certain pro- 
cessor state saves and changes are made automatically by 
the hardware. Upon completion of interruption processing, 
a RETURN FROM INTERRUPT instruction is executed, which 
restores the sailed processor state, and execution proceeds 
with the interrupted instruction. 

Traps, faults, checks, and interrupts are different anom- 
alies that may happen during instruction processing on a 
computer. In HP Precision Architecture, they are all han- 
dled by the same basic mechanism^ The term interruptions 
is used in discussing these anomalies as a group. 

The architecture implements a singJe-JeveJ interruption 
system. This means that once an interruption is chosen for 
service, it cannot be preempted for service by a higher- 
priority interruption. It also implies that only one interrup- 
tion is serviced at a time. If an instruction raises multiple 
interruptions, the highest-priority interruption is serviced, 
and then the instruction is reexecuted. which causes the 
other interruptions to be raised again. Then the next high- 
est-priority interruption is serviced, and so on. 

The nesting of interruptions is not excluded, since the 
interruption handling routine can choose to reenable other 
interruptions once it has saved the appropriate state. Since 
the machine state is saved in registers rather than in mem- 
ory when an interruption is serviced, interruption handlers 
must leave interruptmns disabled until they have saved 
the machine state in memory. 

In certain pipelined processors, interruptions are often 
not precise » in the sense that they may not be serviced 
immediately after the instruction that caused the interrup- 
tion. This is because In overlapped instruction processings 



several successive instructions may already have been par- 
tially or fully processed by the time the interruption caused 
by an instruction is generated. This imprecision adds con- 
siderable complexity to interrupt handling routines. 

In a nonover lapped processor, precise interruptions are 
easy to implement, since an instruction is fetched and com- 
pletely executed before the next instruction is fetched. 
Hence, interruptions can be serviced between instructions, 
that is, after the instruction causing the interruption and 
before the next instruction's processing starts. 

HP Precision Architecture requires that interruption ser- 
vicing appear the same for both overlapped and nonover- 
lapped processors. Hence, all implementations must pro- 
vide precise interruptions* and resume execution at the 
same instruction as a nonoverlapped implementation. 
Traps and Faults. Traps and faults are synchronous inter- 
ruptions, meaning that they are caused by the processing 
of an instruction or a sequence of instructions. A trap oc- 
curs when the function requested by the current instruction 
cannot or should not be carried out* or system intervention 
is desired by the user before or after the instruction is 
executed, A fault occurs when the current instruction re- 
quests a legitimate action that cannot be carried out because 
of a system problem, such as the absence of a page from 
main memory. After the system problem has been corrected 
the faulting instruction will execute normally. 

In HP Precision Architecture, the overflow^ trap and the 
conditional trap occur for arithmetic instructions. The 
privileged operation or privileged register traps occur when 
certain system management instructions or control regis- 
ters are accessed by a process with insufficient privilege. 
An illegal instruction trap is generated for undefined oper- 
ation codes, or illegal instruction sequences which could 
otherwise cause security breaches. The assist exception 
and emulation traps allow assist hardware to request the 
processor to service assist-generated traps, or to emulate 
assist instructions not supported by hardware. 

Virtual memory faults and traps may also be generated 
for instruction fetches or data fetches in virtual mode. For 
example, if the virtual-to-physical address translation is 
not found in the hardware translation lookaside buffer, a 
TLB miss fault is generated. If a virtual memory access fails 
the protection checking required for the access, then a 
memory protection trap is generated. These traps are gen- 
erated independently for instruction and data virtual ac- 
cesses. The first time a page is written, a TLB dirty-bit trap 
occurs, which is used by the system to distinguish unmod- 
ified pages from modified (dirty] pages at page replacement 
time. 

HP Precision Architecture also has a rich set of debugging 
support traps, A BREAK instruction is defined in the ar- 
chitecture to allow the insertion of software breakpoints. 
Whenever such an instruction is executed, a break trap 
occurs* Any store instruction to a virtual address may also 
generate a data memory break trap, if this trap Is enabled 
by a bit in the TLB entry. This allow^s the tracing of all data 
updates to a given page. A similar facility traps on any 
reference whatsoever to a given virtual page. Traps may 
also be generated, if enabled, after a branch is taken, or 
when the privilege level of the running process is promoted 
or demoted. Architectural support for software rollback 
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schemes is also implemented by means of a recovery 
counter trap. A 32-bit control register* the recover>^ counter, 
can be initialized to any integer value. If enabled, the 
counter Is decremented for every nonnullified instruction 
that is executed* and a recovery' counter trap is generated 
w^hen a zero value is reached. The recovery counter can be 
used in fault recovery, to permit an exact reexecution of 
the instruction stream since the last checkpoint. 
Checks mnd Interrupts. A check occurs when a hardware 
malfunction is detected. Depending on the nature of the 
malfunction, checks may be synchronous or asynchronous 
with respect to the instruction stream. HP Precision Ar- 
chitecture defines tv^o types of machine checks: a high- 
priority machine check and a low-priority machine check. 

An interrupt occurs when an external entity, like an I/O 
device or the power supply* requires attention. Interrupts 
are asynchronous with respect to the instruction stream. 

There are thirty- two external interrupt classes* each of 
which can be individually masked by privileged softw^are. 
The architecture defines two control registers specifically 
for handling these external interrupts. The external inter- 
rupt request (EIR) register and the external interrupt enable 
mask [El EM) register each have thirty- two bits, one for each 
external interrupt class. A privileged instruction allows 
the writing of any set of mask bits to the ElEM register and 
the clearing of any selected bits in the EIR register. When 
an external interrupt of any class occurs, its corresponding 
interrupt pending bit Is set in the EIR register. If the corre- 
sponding mask bit in the EIEM register is also one, then 
an external interrupt is taken. An EIR register bit remains 
set, leaving the external interrupt pending, until explicitly 
reset by an interruption handler. 

Relative priority of these thirty-two external interrupt 
classes is not assigned by the architecture or by the hard- 
ware. When multiple unmasked external ijiterrupts occur 
simultaneously, or when there are multiple external inter- 
rupts pending in the EIR register, the external interrupt 
handler selects the order of service. 

Interruption Parameters and Servicing, Six control regis- 
ters are defijied to save interruption parameters and expe- 
dite the processing of interniptions. The collection of infor- 
mation in these interruption parameter registers occurs 
only when the interruption state collection enable flag (Q 
bit) in the processor status word [PSVV) is set. 

These interruption parameter registers save the processor 
status word of the interrupted process, the instruction that 
is interrupted, and the data address (space and offset por- 
tions) for memory reference instructions. Two other register 
pairs form two queues, saving the space and offset portions 
of the addresses of the first two instructions to be processed 
upon returning from the interruption. 

The two queues are necessary because in an architecture 
with delayed branching, at least two return addresses must 
be saved before jumping to the interruption handler. Two 
are necessar\^ because the last instruction to be completed 
before the interruption may be a taken branch. In this case 
the next two instructions lobe executed may not be contigu- 
ous, since one is the delay instruction and the other is the 
target instruction. These queues are constantly updated by 
the hardware whenever interruption parameter collection 
is enabled. When an interruption is taken, the queues and 



other interruption parameters are p respired by disabling 
further Lnterruption collection. 

Interruption servicing is implemented as a fast context 
switch, which is much simpjer than a complete process 
swap. When an interruption occurs, the current processor 
status, represented by the PSW, is saved, Then, the PSW 
is cleared to zeros to disable further interruptions, to enable 
real-mode addressings and to freeze the information coJ* 
lected in the interruption parameter registers. The current 
privilege level is set to the highest privilege level. The 
control flow then passes to a vectored location in an inter- 
rupt vector table, which is dynamically relocatable. This 
simple set of architecturally defined operations facilitates 
a fast and uniform switch to interruption servicing for all 
implementations. 

Addressing and Protection Model 

HP Precision processors access memory using byte ad- 
dresses. Larger addressable units include half words, 
words, and double w*ords. An address is either physical or 
virtuah All load and store instructions can be used in either 
virtual or physical mode. Virtual mode is enabled sepa- 
rately for instruction fetches and data accesses by two flags 
in the processor status word . 

A pointer to physical memory is a 32-bit unsigned integer 
w^hose value is the address of the first byte of the operand 
it designates. Physical addresses are used directly, with no 
protection or access rights checking performed. Virtual ad- 
dresses are translated to physical addresses and undergo 
protection and access rights checking as part of the trans- 
lation. This allows the hardware support for access control 
to be built into the storage unit. 

The input/output (I/O) architecture is memory mapped. 
That is, complete control of all system components (of 
which I/O attachments are a special case) is exercised by 
the execution of load and store instructions to virtual or 
physical addresses. This approach permits I/O drivers to 
be written in high-level languages. Furthermore, since the 
usual page-level protection mechanism is applied during 
virtual -to-physical address translation, user programs can 
be granted direct control over particular I/O devit:es with- 
out compromising system integrity. 

Virtual Memory Addressing 

A virtual address is defined globally and has the same 
meaning when used by any process^ This is in contrast to 
other architectures, which permit use of the same address 
for different objects by different processes. The virtual ad- 
dress space is so large that processes can be assigned sepa- 
rate address ranges for private data. Address translation 
information does not need to change upon a process switch 
and the information needed for address translation can be 
represented more compactly. Global virtual addressing 
therefore allows closely coupled processes to accumulate 
a stable working set of address translations in spite of fre- 
quent process switching. 

Virtual memory is structured as a set of address spaces, 
each containing 2^^ bytes- A level-one processor Imple- 
ments 2^** spaces (16*bit space registers), and a level-two 
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processor implemBiits 2^' spaces (32-blt space registBrs). 
A space is specified by a space identifier, and is divided 
into pages, each 2048 bytes in lengtfi. 

For a level-two processor, the concatenation of a 32''bit 
space identifier and a 32-bit offset within the space forms 
a virtual address. Alternatively, a virtual address may be 
viewed as the concatenation of a 53-bit virtual page number 
and an 11 -bit offset within tlie page. 

For virtual addressing, space identifiers are specified in 
space addressing registers. These include the space portion 
of the instruction address register and the eight space regis- 
ters SR through SR 7 (see Fig. 4). One such register is 
implicitly or explicitly selected by every instruction that 
generates a virtual address. 

SR is used as an implied target by the interspace pro- 
cedure call instruction. SR 1 through SR 7 have no architec- 
turally defined functions, but it is expected Ihat their use 
will be constrained by the following software conventions. 
SR 1 through SR 3 are used as scratch registers for the 
manipulation of 64-bit virtual pointers. SR 4 tracks the 
current program's space and provides access to hteral data 
contained in the current code space. SR 5 points to a space 
containing process private data, SR 6 to a space containing 
data shared by a group of processes, and SR 7 to a space 
containing the operating system's code, literals . and data. 
The conventions for SR 4 through SR 7 were chosen to 
permit use of 3Z-bit virtual address pointers (see below] 
for almost all data references. 

SR 5 through SR 7 can be modified only by code execut- 
ing at the most privileged level. SR through SR 4 can be 
changed by an unprivileged user* Shared libraries or sub* 
systems will be assigned individual code spaces, and 
branching into those other spaces will involve changing 
SR 4. 

[nstruction and Data Addressing. Instruction addresses are 
computed for instruction fetch ^ instruction cache flu.sh in- 
structions^ instruction TLB instructions, and branch target 
calculations. Instructions that explicitly reference a space 
register use the 3 -bit S field, located in the instruction, to 
designate one of the eight space registers. 

Data addresses are computed for load, store, semaphore, 
probe, data cache, and data TLB instructions. Data addresses 
specify one of the eight space registers in an interesting 
\vay: only a 2-bit S field in the instruction is used. When 
the 2-bit S field is nonzero, it selects the corresponding 
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Fig. 4. Space register convenf/ons. 



space registers 1 , 2, or 3, When the S field is zero, the space 
register is designated by adding four to the two high -order 
bits of the base register specified in the instruction. Tliis 
allows the selection of space registers 4 through 7. 

Data references with the S field equal to zero allow 
addressingof four distinct spaces selected hy the high-order 
bits of a 32-bit pointer. This is called shuH-pointer address- 
ing (Fig. 5), since a 32-bit value both specifies an offset 
and selects a space register. Only one fourth of each space 
is directly addressable with short pointers. This region corre- 
sponds to the quadrant selected by the upper two bits. For 
example, if a base register contains the hex value 80001000, 
the content of space register 6 is tlie space identifier and 
the third quadrant of the space is directly addressable. 

Short-pointer addressing allows the pointer data type of 
conventional languages to be 32 bits in length. Therefore, 
such pointers can be handled efficiently in the general-pur- 
pose registers. Also, pointers axe the same length as the 
standard integer data type^ a situation assumed by a number 
of existing high-level langauge programs. Long pointers are 
48 bits or 64 bits in length, consisting of a 16'bit or 32-bit 
space identifier together with a 32-bit byte offset within 
the space, for level -one and level-two processors, respec- 
tively. 

Software Virtual Address Translation 

rLHs [see box. page Ui] do not contain the translations 
for all pages in memory simultaneously. When they do not 
have the desired translation, a TLB miss occurs, fn many 
architectures, TLB misses are handled in microcode. In HP 
Precision Architecture, they may he handled in software. 
When a TLB miss is detected, the hardware does not have 
sufficienl information to complete tlie instruction being 
executed. Instead, an interruption is generated to invoke 
the appropriate TLB miss handler One miss handler handles 
misses during instruction fetch, and another handles misses 
during data access. The virtual address causing the miss is 
directly available to the TLB miss handler in interruption 
parameter control registers to expedite miss handling. 

Because of the critical effect on system performance of 
the speed of address translation, all information required 
to translate the virtual address of a page that is actually 
present in physical memory must be permanently resident 
in memory. Because of the size of the virtual address space, 
tables describing all virtual pages cannot be kept perma- 
nently in memory. Thns the data structures used to trans- 
late valid virtual addresses [no page faull) describe only 
physically present pages and have a size proportional to 
the sixe ol physical memory » consuming less than 2% of 
the available memory. The information represents a one-to- 
one mapping between physical and virtual pages. Thus it 
cannot support memory aliasing (see box, page 16) or pro- 
cess-specific address translation. A desire to use these ef- 
ficient structures was an important motivation for disallow- 
ing both features. 

This address translation information resides in a physical 
page directory (PDIR). The physical-to-virtual address 
translation is obtained by using the physical address as a 
direct index into the PDIR. The translation of a virtua:! 
address to a physical address is accomplished using two 
tables, the hash table and the PDIR. Each table is located 
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by a pointer which defines its absolute starting address. 
For efficiency, these pointers are kept in controJ registers 
(assumed to be CR 24 for the PDIR address and CR 25 for 
the hash tabJe address). 

The purpose of the hash algorithm is to map virtual ad- 
dresses to a smaller, denser name space. The number of 
entries in the hash table is typically a multiple of the 
number of entries in the PDIR, rounded up to the nearest 
power of two. Since multiple virtual addresses can map 
into the same hash table entry, they are linked together as 
a chain of PDIR entries. The TLB miss handler hashes the 
virtual address, looks up the start of the chain in the hash 
table, and looks through the chain in the PDIR until it finds 
either the match or the end of the chain. If it finds the 
match, it puts the information from the PDIR into the TLB 
and retries the instruction. If it finds the end of the chain, 
the page is no* in memory and a page fault is signaled by 
the software. 

The physical page directory (PDIR) contains one entry 
for each page of physical memory, plus one for each phys- 
ical or virtual I/O device. The entries for physical pages 
are at nonnegative offsets from the location pointed to by 
CR 24, and the I/O entries are at negative offsets. This 
arrangement corresponds to the layout of the 32-bit phys- 
ical address space which places physical memory at the 
lower end of the space and memory mapped I/O devices 
at the upper end. 

The design of the hash table and PDIR are such that later 
implementations can service TLB misses in hardware, with 
a reduction in the time spent servicing TLB misses. Control 
registers have been reserved to contain the hash table ad- 
dress and PDIR address. 

Paging Management. One function of an operating system 
is to swap out pages that have not been accessed recently, 
to make room for pages being accessed that are still on 
disc. To help implement this, there is a reference bit for 
each page, within the PDIR entry, even though there is no 
hardware bit corresponding to il in the TLB. Instead, the 
entry is only allowed to be in the TLB if the reference bit 
is seL When the reference bit is cleared* the TLB entry is 
also purged by software. The next time there is a TLB miss. 



the miss handler will also set the reference bit in the PDIR. 
Thus, the operating system can clear the reference bit. and 
if the bit is still clear sometime later when it examines it 
again, it knows that the page has not been accessed in the 
meantime. 

Each entry of the PDIR (and the TLB) has a dirty bit that 
tells whether the page has been modified since it was 
brought in from disc. When the page is first brought in, 
the dirty bit is clear. As long as only reads are done to the 
page, the bit will remain clean However, the first time a 
program tries to store data to that page, the TLB causes a 
dirty bit update trap, which sets the bit to one in both the 
PDIR and the TLB. This provides information to the operat- 
ing system so that il can avoid writing out unmodified 
pages, since the copy on disc is still valid. 

Access Control 

Access rights checking is based on the access rights and 
access ID fields in the TLB entry used to perform the trans- 
lation. Access rights checking occurs with virtual address 
translation, unless disabled by the P flag in the PSW. There 
is no access control when using physical addressing. 

Fields in the TLB entry for a particular page permit con- 
trol of access to the page in three dimensions: 

■ Which of data read, data write, instruction execute, and 
the privilege level change function of the GATEWAY in- 
struction are permitted (What) 

■ The privilege level at which the process must be execut- 
ing (When) 

■ The process or group of processes allowed to access the 
page (Who). 

These three dimensions are provided by two indepen- 
dent, simple mechanisms that combine to provide the re- 
quired protection which can be evaluated in parallel to 
provide efficient access controL The combination is de- 
signed to support both conventional and virtual machine 
operating systems. 

Access Rights. The first two dimensions of access control 
are provided using the access rights field of the TLB entry 
and the process privilege leveL There are four levels (0 to 
3), with being the most privileged. Associated with each 
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HP Precision Architecture Caches and TLBs 



An HP Precision processor typically intertaces to the memory 

system via the rranstation lookasfde buffer (TLB) and the cache 
memory Tiie arcliitecture is designed to aflow simple, high- 
speed impiemenlations by making the TLB and caclne s/ssibie lo 
software, and by pJacing constraints on software. The architec- 
ture also explicitly separates instruction and data caches, and 
instruction and data TLBs, although thts is not a restriction on 
hardware impiementations- 

A cac^e is a smali, high-speed memory that shortens main 
memory access times by keeping copies of the mosl recently 
accessed data The cache is divided into blocf^s o1 data, and 
each block has an address lag that specifies the corresponding 
block of memory When the processor accesses data, the block 
is copied trom mam memory Jnto the cache, if the processor 
modifies the data (by doing stores), the copy in the cache will 
be more up-to-date than the copy in memory The stale data m 
the memory at the place specified by the tag is eventually up- 
dated to correspond to the new data in the cache, using either 
the copy- back or the wnte-lhrough update strategies. 

Similarly, a TLB speeds up virtual address iranstations by act- 
sng as a cache for recent translations. When the processor 
accesses memory with a virtual address, the TLB checks for an 
entry wilh that Vl1ua^ page number If it is present, the corres- 
ponding physjcal page number is used to generate the physical 
address. Otherwise, there is a TLB miss, which must be serviced 
before the virtual memory access can be finished. 

To allow the implementation of Jarge, high-speed caches, the 
architecture disallows address aliasing, ihe capability of having 
two different virtual pages mapped to the same physical page. 
WhiJe address aliasing is of some use Eo software, it has severe 
impact on cache design INformaily, a portion of the address 



called the index is used to specify a block or a smalt group of 

blocks to be examined for a matching tag, instead of examining 
all blocks in the cache. Address aliasing precludes using the 
virtual page as part of the index Otherwise, a virtual access 
could put data into the cache based on its index, and a later 
virtual access, using the other (aliased) address, would not find 
it in the cache because the index was different m the virtual page 
portion The second access would then go to main memory, 
where it would get an inconsistent or stale copy. 

Since HP Precision Architecture prohibits the use of address 
aliasing, the cache can use the virtual page portion of the address 
as part ot the jndex. without causing the stale data problem 
described above. This allows the cache eo be accessed in parallel 
with the TLB without restricting the size of the cache to that of 
the page size multiplied by the set-assoc^ativity of the cache 
organization 

If an obfect is to be referenced by both its virtual address and 
its corresponding physical address, software must flush the 
cache before accessing the data in the other mode The one 
exception is if the physical and virtual addresses are identical, 
namely, the virtual address is in space zero and the offset within 
the space is the same as the physical address. Since the 
addresses are identical, the index chosen by the cache would 
be identical, thus avoiding the above stale data problem. This 
case is called equivalent mapping. 

Untproces&or Cache Management 

HP Precision Architecture makes caches visible to software, 
and supports separate instruction and data caches when desir- 
able for extra bandwidth, or a unified cache for reduced expense 
It will also support very low^ost systems without caches, where 



process is a current privilege leveL 

The access rights information is encoded in seven bits 
divided into three fieids: fype, first privilege level (PLt)> 
and second privilege level [Pl^) fields. The type field de- 
fines the use of the page (data or code) and, for privilege 
promotion instruct ions , the privilege level to w^hich the 
process will be promoted. PLi and PL2 define the privilege 
levels required for read, write, or execute access to the 
page. The meaning of the type field and the interpretation 
of PLI and PL2 are given in Fig. 6. Read and write fields 
specify the least privileged levels allowed to read or write 
the page, respectively, Xleast gives the least privileged level 
allowed to execute instructions from that page. Xmost gives 
the most privileged level allowed to execute instructions 
from the page and is used to prevent privileged code from 
inadvertently branching onto a page that cannot be trusted. 

The privilege level mechanism allows a process to have 
different access rights over time without the overhead of 
changing TLB entries when access changes or at process 
switch. Thus user programs (privilege level 3] can invoke 
the services of an operating system supervisor [privilege level 
1) or kernel (privilege level 0] using an efficient procedure 
call and no interruption or process switch is required. 

The entry to a more privileged routine can be im- 
plemented as a procedure call to a GATEWAY instruction 
that branches to the body of the routine. If a GATEWAY 
instruction is fetched from a proprietary code page, then 
when it executes it changes the privilege level to that 



specified by the low^-nrder tw^o bits of I he type field for 
that page (if that level is more privileged than the current 
level). The GATEWAY instruction stores the caller's privilege 
level In the return address register so that it cannot be 
"forged" by the caller. 

The architecture defines two trap conditions (higher and 
loviJ^er privilege transfer traps) that can be enabled to allow 
an operating system to intercept privilege level changes. 
These are provided to support languages that allow multi- 
ple processes to share a single stack with different access 
rights. 

Access ID, A second field in the TLB entry, the 15-bit 
access ID, provides the third dimension of access controL 
It allows each process sharing memory to access different 
domains in memory without the overhead of changing 
fields in the TLB (and associated data in memory) on pro- 
cess switch. 



Type (3) PLI (?) PL2 {2} 



Type 



PL1 



PL2 



U$^ 






Read 




Read 'Only Data Page 


1 


Read 


Write 


Normal Data Page 


2 


Read Xleast 


Xmost 


Normal Code Page 


3 


Read Xteast 


Write Xmost 


Dynamic Code Page 


4-7 


Xleast 


Xmost 


Proprietary Code Page 



Fig. 6, Interpretation of access rights fietds. 
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the cache control instructions are treated as NOPS 

FLUSH DATA CACHE arvi FLU&^ tr*STRUCTtON CACHE instruCttOns 
rewxjve a cacne bsock and update memory \i necessary purge 
DATA CACHE removes a cache block witfsout update Tihe laner 
rs used Ofity when the daia can be destroyed ^qf example when 
the page is rerrKDved at the conclusion of a prograrri 

The architecture puts the res pons tbi^ity of uniprocessor cache 
consistency on software based on the assumpiion that the soft- 
ware knows when specrai action is needed to ensure consistency 
Software must take special action when it rs c^ngtng a page s 
wftuaf address, when it is modifying the rnstTucfion siream (self- 
nrMXii tying code), and when rt ts performing I/O 

When the operating system changes a pages virtuaf address, 
it must flush the range of addresses for that page, to ensure that 
there are no blocks in the cache using the old virtual address 

if software stores into the instruction stream, the modification 
would occur in the data cache, while instructions are fetched 
out of the instruction cache Rather than have the cache somehow 
figure out that software is doing thfs. software is required to flush 
the data from both the data cache (to update mam memory) and 
the instruction cache (to force the next fetch to go to memory) 
after modification Sirsce self-modifying code is so infrequent. 
the extra time required is negligible, 

From the standpoint of the cache, I/O is like another processor 
reading or modifying memory. If the I/O system is reading data 
from memory that is currently in the cache, it is readrng a stale. 
ouf-of-date copy Other architectures have solved this problem 
either by having 1/0 go through the cache, or by having alE 1,0 
transactions interrogate the cache to see whether it has a more 
up-to-date copy This either uses up available cache bandwidth. 
depriving the processor, or lengihens the cache cycle lime, slow- 
ing down the entire computer HP Prectsion Architecture requtres 
software to flush the address range involved in the I/O transfer 
before It occurs, so that the cache does not need to do any 



cf^ecking The overhead of flushing for I/O ts a very small anrjount 
and less than ttie impact on performarice tncurred by the other 
schemes 

HP Precis ton instructions include a nondfvisible load and store 
zero instruction. LOAD AND CLEAR WORD, which is stmklar to the 
test and set operation in otf>er archtiectures This Enstruction 
reads a wofd from main mefriory. Hushing the cache first if it is 
present, then clears the word m memory, m one tnd (Visible oper- 
alion It is used to implement semaphores to synchonize access 
to data structures that are shared between the processors and 
the I/O modules, or for data staictures that can t>e modified by 
two or more processes operating asynchronously 

Multiprocessor Cache Managemertt 

For HP PrecJSJon uniprocessors, software is responsible for 
cache consistency For mu It i processors, however, hardware is 
responsible for cache consistency since the model presented 
to software is one in which all the processors share a single 
instruction cache, a single data cache, a single instruction TLB, 
and a stngle data TLB This is because it may be difficult for 
software to recognize all data consistency situations in a mufti- 
processor and handle these situations efficientty for both uni- 
processor and mutti processor systems Software is still respon- 
sible for maintaining consistency for I/O, for instruction modifica- 
tion, and for virtual address mapping 

In an actual multiprocessor system, each processor may have 
its own cache and TLB To maintain the model of a single shared 
cache and TLB among processors, standard cache consistency 
methods are used in addition, the explicit cache and TLB flush 
and purge instructions are broadcast to all processors, so that 
a flush instruction executed by one processor will affect atl pro- 
cessor caches or TLBs m the system. The broadcast flushes and 
purges still do not affect 1/Q modules, allowing them to remain 
simple 



An access ID of zero defines a page with public access 
allowed J subject only to access rights checking. A nonzero 
access ID permits access to the corresponding page only 
when one of the four protection IDs in control registers 
matches the access ID. 

The four protection IDs designate yp to four groups of 
pages that are accessible to the currently executing process. 
Four are provided to facilitate the controlled transfer of 
information between logical environments. The low-order 
bit of each of the four protection IDs is the write disable 
(WO) bit. When the WD bit is set to 1, writing is disallowed 
for all privilege levels to the pages so protected. For exam* 
pie. the WD bit allows a single writer and multiple readers 
for a group of processes. 

Privileged software needs a mechanism by which it can 
avoid performing, on behalf of a less privileged caller, ac- 
tions nnl permitted the caller. This is provided by the 
PROBE Histructions, which test the caller's ability to read 
or write a particular page of memory. 

Functional Operations 

The data tiansformation instructions provide all of the 
common arithmetic and logical functions. There are also 
several uncommon functions that provide building blork 
instructions for complex operations and functions for effi- 
cient high-level language optimisations. The transforma- 



tion instructions form a powerful resource for compilers 
to f?enerate efficient rode while defining an easily im- 
plemented hardware execution engine. 

Kach Iransformation instruction also specifies the condi- 
tional occurrence of either a skip or a trap* based on its 
opcode and the condition field. An immediate source can 
also be specified. The arithmetic/logical instructions are 
not completely orthogonal. Only ihnse operations and op- 
tions considered useful were defined. 

Arithmetic Operations 

Addition and subtraction instructions offer the widest 
flexibility in operand specification, condition formation, 
and testing. The two operands can come from two general 
registers, or from one general register and an 11-bit signed 
immediate. The SUBTRACT IMMEDIATE instruction is a re* 
verse subtraction lo allow subtraction of a variable from 
an immediate. Subtraction of an immediate from a variable 
is performed with an ADD IMMEDIATE instruction. The carry 
or borrow bit can be included in the add it ion or subtraction. 

Software will be able to construct any often needed func- 
tion in a singie instruction. Since a conditicnial trap or an 
overflow trap can optionally be specified, many range vio- 
lations and overflow checks required by high-level lan- 
guages can be performed without extra instrui;tum.'>. For 
some checks an additional instruction might be needed, 
but generally the architecture provides for the optimization 
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of the high-frequency execution path. 

Studies of large collections of programs show that integer 
multiply and divide operations are infrequently used. Fur- 
thermore, when multipiy is used, one of the operands is 
usually a constant known at compile time. Hence, instead 
of implementing a general multiply or divide instruction, 
HP Precision Architecture implements multiply and divide 
primitives, which do not require additional execution 
hardware. 

The SHJFT AND ADD instructions are used as building block 
multiply instructions. They specify a one, two. or three-bit 
shift of one of the source registers before adding it to the 
other source. By combining a short sequence of these in- 
structions, muiti pi ication by a constant can be done qui ckly . 
The SHIFT AND ADD instructions are performed by the basic 
execution engine, and share the use of the preshifter multi- 
plexers in the ALU data path with the address calculation 
for the load and store instructions. These easily im- 
plemented multiply primitives are used effectively by the 
software for a variety of constructs. 

Multiplication by the constants 3, 5, 9. and any power 
of 2 can be done in one instruction. Multiplication by other 
small constants can be performed in two or three instruc- 
tions. When it is necessary to perform muhiplication by a 
variable, a specialized subroutine brealcs the multiplier into 
four-bit pieces and forms the complete product in an aver- 
age of twenty instructions.^ 

Division by small constants is handled as special cases 
by the compilers, while for general cases, the DIVIDE STEP 
instruction implements a single-bit nonrestoring division 
operation. A specialized subroutine uses thirty-two of these 
instructions, in combination with SHIFT DOUBLE instruc- 
tions, to produce the quotient and remainder. 

The added hardware cost and potential increase in basic 
machine cycle time, coupled with infrequent use, ruled 
out the inclusion of division and multiplication in the basic 
instruction set. The architected assist instruction exten- 
sions include integer multiply and divide functions for 
applications requiring higher frequencies of multiplication 
and division. 

Logical and Field Operations 

Logical operations are fundamental instructions for data 
manipulation. OR, XOR, AND, and AND COMPLEMENT instruc- 
tions provide a full range of logical operations. The AND 
COMPLEMENT instruction ANDs a register with the comple- 
ment of a second register. This operation reduces the num- 
ber of masks required for carrying out bit manipulation. 

Boolean values are easily generated using the COMPARE 
AND CLEAR instructions. This instruction first assumes a 
Boolean value of false by always storing a zero in the target 
register, and specifies the negation of the desired Boolean 
condition for the conditional nullification of the following 
in.struction. The following instruction, if not nullified, w^^ill 
set the target register to true. Other architectures often re- 
quire branch instructions to implement an equivalent func- 
tion. 

The field manipulation instructions, like EXTRACT, DE- 
POSIT, SHIFT DOUBLE, and BRANCH ON BIT, are implemented 
by the shift-merge unit of the basic execution engine [Fig* 
3). 



An EXTRACT instruction takes a field from any portion 
of a word and creates a result with the field right-justified. 
The remainder of the target register is filled with zeros or 
sign-extended, supporting both logical and arithmetic right 
shifts as special cases. 

A DEPOSIT instruction takes a right-justified field and 
puts it into any portion of the target word, thus merging 
the selected field with data in the rest of the word. DEPOSIT 
IMMEDIATE deposits a sign -extended five-bit immediate into 
the target register, which is perfect for setting or clearing 
a small number of bits in a register. ZERO AND DEPOSIT 
clears the remainder of the target, which is useful when 
the original target information is not wanted. DEPOSIT in- 
structions can easily implement left shift operations and 
multiplications by a power of two. 

Fig. 7 illustrates the movement of an arbitrary field. A, 
from general register x to another arbitrary field position 
in general register y, using a pair of extract and deposit 
instructions. General register z is used as a temporary regis- 
ter for this operation. 

SHIFT DOUBLE instructions concatenate two registers, 
shift them to 31 bits, and store the 32 rightmost bits into 
the target. If one of the source registers is general register 
zero, a left shift or right shift is performed. If both source 
registers are the same, a rotate operation is performed, SHIFT 
DOUBLE instructions are useful for unaligned byte moves 
or bit-block transfers, and for extracting data fields span- 
ning word boundaries from packed records. 

The fields for these operations are specified by position 
and length. The length is always an immediate in the in- 
struction, but the position may be either an immediate or 
the contents of a control register called the shift amount 
register. This allows dynamically generated shift amounts. 
Unlike other architectures that specify the field position 
by encoding the leftmost bit in the field, HP Precision Ar- 
chitecture specifies the rightmost bit position. This was 
done to simplify the control logic for the shifter by making 
the number of bits of right shift depend only on field posi- 
tion, not on both position and length. 

Unit Operations 

HP Precision Architecture includes a set of five instruc- 
tions designed to support the parallel processing of small 
units [digits, bytes, and half words) within a word. These 
instructions make use of the seven low-order PSW carry/ 
borrow bits. They are included in the architecture primarily 
to facilitate string search [byte and half word units) and 
decimal arithmetic (digit units). The half word units sup- 
port the processing of 16-bit international character sets. 

The UNrr XOR and UNIT ADD COMPLEMENT instructions 




GRx 



GRz 



GRy 



1 . Extract A from QR x into GR z. 



2. Deposit A from GR z into GR y. 



Fig. 7. Movement of an abrimfy field using exlracf av?d de- 
posit instructions. 
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can be used to compare corresponding subunits of two 
words for equality' or a less-tiian relationship. These oper- 
ations are particularly useful for scanning for b^'le or half- 
word vahies a fuil word at a time. 

Packed decimal nuinbers represent each decimal digit 
in a 4-bit field. When these numbers are to be added to- 
gether, 6 must be added to each digit of one operand so 
that carries will propagate properly during binar>'^ addition. 
After addition, each result digit must have 6 .subtracted 
from it unless the addition for that digit generated a carry. 
Abo, when a repeated sequence of additions is to be per- 
formed, the bias must be restored to the result by adding 
6 to each digit from which a carr\' was generated. These 
correction steps are performed by the OECiMAL CORRECT 
and IMTERMEDtATE DECtMAL CORRECT instructions, respec- 
tively. 

Assuming that the bias value and operands are in general 
registers, BCD additions and subtractions require three in- 
strucUons to retire each 8-diglt word. 

Instruction Formats and Encoding 

In HP Precision Architecture^ all instructions have a fixed 
length of thirty-two bits, which is one word of memory. 
Time-cntical functions are placed in fixed^position fields, 
sn that they can proceed with minimal or no decoding. 
Since all instructions are word-aligned* an instruction 
never crosses a page boundary. 

The addresses of the two general register source operands 
for the execution engine are placed in fixed-position fields 
(bits <6:10> and bits <11:15>|, so that registers can be 
read before or during the decode phase of the instruction. 
If an tmmediate operand is required rather than a general 
register opprnnd. the selection is done by a multiplexer in 
front of the a () pro pri ate port of the ALU or shift-merge unit. 

In instructions with three register specifiers, the third 
register specifier is placed in the last five bits of the instruc- 
tion* bits <27:31>. However, any registers to be used as 
source operands must be specified in the first two register 
specifier fields. A register used as the target register for a 
data transformation or data movement operation can be 
specified in any of the three register specifier fields. Dec:od- 
ing the address of a target register is not time critical, since 
the writing of a result occurs later than the reading of 
operands. 

The space register specifier field is also placed in a fixed- 
po.sition field, since it is also used to supply an operand for 
vjrtiinl memory addressing. 

The major operation code field (opcode) is placed in a 
6-bit fixed-position field. The operations are divided into 
subclasses, each subclass occupying one point in the code 
space of the major opcode. Each operation in a subclass 
occupies one point in its suboperation (subop) code space. 
The size of the subop field depends on the particular sub- 
class of operations. The placement of the subop field is 
done to minimi?:e the impact on the fixed fields of more 
time-critical operations. The encoding of the subop field 
is done to minimize decoding within a subclass. Often ^ 
bits in the subop field can be wired directly to control 
points in the particular portion of the processor Implement- 



ing this sutxilass of instructions. 
In the case of a subclass of operations with a relatively 

long immediate field in the instruction format, a subop 
field would take away bits lirom the long immediate field. 
So. each of these long* immediate instructions is assigned 
a point in the major opcode space. Examples are the load 
and store instructions with long displacements and the 
ALU instructions with long immediates, 

Immed fates embedded in an instruction are sometimes 
broken up Into different fields so as not to impact the place- 
ment of fixed fields, and to minimize the multiplexing 
required for assembling immed iates of different lengths. 

Although im mediates come in various sizes, their sign 
bit is always in a fixed position: the rightmost bit position 
of the immediate. This aspect of the instruction encoding 
enables immediate sign extension to proceed without 
lengthy decoding and selection from various bit positions, 
which would happen if the sign bit were placed in the 
customary leftmost position of the variable- length im- 
mediate fields. 

Formats 

Fig, 8 shows the instruction formats used to encode all 
HP Precision Architecture instructions. The first three for- 
mats are for load and store instructions, followed by the 
instruction formats for long immediate instructions, branch 
instructions, three types of ALU instructions, system man- 
agement instructions, the DIAGNOSE instruction, special 
function unit instructions, and coprocessor instructions. 

The first format, for the long-displacement load and store 
instructions, essentially determined the positions of most 
of the major fixed -posit ion fields like the opcode, the two 
source register specifier fields, and the space register 
specifier field. It also determined the right alignment of an 
immediate field, with the sign bit occupying the rightmost 
instruction bit. The ALU 3R format, for the basic three-regis- 
ter data transformation operations, determined the posi- 
tions of other fixed-position fields like the third register 
specifier field, the condition field, and ihe falsify (condi- 
tion negation) field. 

The last three formats show the instruction extension 
capabilities in the architecture- One major opcode is re- 
served for the DIAGlviOSE instruction, which can be used to 
define implementation dependent instructions. Only the 
major opcode of this instruction is defined. The next two 
are assist instruction formats, for the special function unit 
and coprocessor types of assists, respectively. For example, 
the floating-point coprocessor uses coprocessor unit iden- 
tifier "zero' and encodes all its nperations in the u fi*^ids. 
While DIAGNOSE instructions are not portable between im- 
plementations, the assist instructions are fully portable, 
with transparent software emulation of these instructions 
in the absence of hardware support. 

Conclusion 

HP Precision Architet:ture is frequently referred to as a 
reduced instruction set computer (RISC) architecture. In- 
deed, the execution model of the architecture is RISC- 
based, since it exhibits the fefttures of single-cycle execu- 
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instruction formats. 



tion and register-based execution, where load and store 
instructions are the only instructions for accessing the 
memory system. The architecture also uses the RISC con- 
cept of cooperation between software and hardware to 
achieve simpler implementations with better overall per- 
formance* 

HP Precision Architecture, however, goes beyond RISC 
in many ways, even in its execution modeL For example, 
RISC machines emphasize reducing the number of instruc- 
tions in the instruction set to simplify the implementation 
and improve execution time. Only the most frequently 
used, basic operations are encoded into instructions. How- 
ever, frequency alone is not sufficient, since some instruc- 
tions may occur frequently because of inefficient code gen- 
eration, arbitrary' software conventions, or an inefficient 
architecture. 

In designing the next -gen era tion architecture for Hew- 
lett-Packard computers, the intrinsic functions needed in 
different computing environments like data base, computa- 
tion intensive, real-time, network, program development, 
and artificial intelligence environments were determined. 
These intrinsic functions are supported efficiently in the 
architecture. Minimizing the actual number of instructions 
is not as important as choosing instructions that can be 
executed in a single cycle with relatively simple hardware. 
Complex, but necessary » operations that take more than 
one cycle to execute are broken down into more primitive 
operations, each operation to be executed in one instruc- 



tion. If it is not practical to break these complex operations 
into more primitive operations^ they are defined as assist 
instructions, by means of the architecture's instruction ex- 
tension capabilities. If more than one useful operation can 
be executed in one cycle, HP Precision Architecture defines 
combined operations in a single instruction, resulting in a 
more efficient use of the execution resources and in im- 
proved code compaction. 

HP Precision Architecture's execntion model has other 
noteworthy features like its heavy use of maximal-length 
inunediates as operands for I he execution engine, and its 
efficient address modification mechanisms for the rapid 
access of data structures. The architecture also includes 
some uncommon functions for efficiently supporting the 
movement and manipulation of unaligned strings of bytes 
or bits, and primitives for the optimization of high-level 
language programs. 

HP Precision Architecture has gone beyond RISC in its 
control flow model with its conditional branch optimiza- 
tion features, its ring-crossing branch instructions, its nul- 
lification features, its conditional trap feature, its debug- 
ging support, and its efficient interruption mechanisms. 

The architecture's virtual memory addressing and protec- 
tion mechanisms support a wide range of system needs, from 
the smallest controller to the largest mulli network environ- 
ment. Indeed, the HP Precision program was internally code- 
named Spectrum, since its objective was to serve the full 
spectrum of HP customers' information processing needs. 



20 HEWLETT-PACKARD JOURNAL AUGUST 19S6 



)Copr. 1949-1998 Hewlett-Packard Co. 



In summary. HP Precision Architecture represents an 
evolution of the more successful ideas in past computer 
architectures, combined with support for the anticipated 
needs of future computer systems. 
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HP Precision Architecture: The 
Input/Output System 

A simple, uniform arcNtecture satisUes the 110 needs of 
large and small systems, and provides flexibility for future 
enhancements. 

by David V. James, Stephen G. Burger, and Robert D, Odineal 



THE HP PRECISION L'O SYSTEM was defined to pro- 
vide a flexible framework to leverage existing I/O card 
designs without restricting the capabilities of low- 
cost or high-performance I/O cards in the future. The HP 
Precision Architecture deveiopment program provided an 
opportunity to incorporate and achieve global objeutives of 
scalability, leverageability. and flexibility in a corporate I/O 
strategy » These objec:tives have been met by basing the I/O 
system on the design strategies of simplicity and uniformity. 

Scalability is provided by a unified family of compatible 
buses. A basic single-bus configuration can be extended to 
include higher-performance or low^er- performance buses, 
or expanded to include additional buses of the same perfor- 
mance. 

LoverageabilMy requires interchangeable parts- Hard- 
ware interr:hangeability is achieved by using one physical 
component in systems having similar requirements for 
function and performance. Software interchangeability is 
achieved by using one version of I/O driver software for 
functionally equivalent hardware components that differ 
only in performance and capacity. 

Flexibility is more than the use of leveraged components. 
A system is flexible when it is implemented to meet existing 
needs and is aherable to match the changing needs of the 
future with minimal perturbation of a customer's existing 
system. A flexible I/O system allows the existing 1/0 card 
designs to be leveraged for the initial product shipment, 
while also allowing the I/O system to be upgraded to sup- 
port more demanding I/O requirements in the future (e.g., 
multiprocessors^ shared peripherals, and memory mapped 
graphicsl. Flexibility is also provided by minimizing con- 
figuration restrictions in the I/O system. 

Levels of Design 

The definitiun process for the 1/0 system included rigor- 
ous documentation at all levels of the design* These design 
levels included the I/O architecture, the conned protocol, 
and multiple definitions of bus standards. The I/O architec- 
ture defines the types of modules that t:onnect to an HP 
Precision bus (including processors, memory, and I/O) and 
defines the memory mapped registers used by other mod- 
ules to control or observe the module's activity. This ar- 
chitectural interface is defined in sufficient detail to allow 
the hardware and software to be developed independently. 
HP Precision I/O Architecture includes the definition of 
simple instructions fetched from memory and executed by 
i/O modules with direct memory access (DMA) capabilities, 



but does not include the definition of instructions execut€sd 
by the more general-purpose processor module. 

The connect protocol defines the standard set of bus 
transactions used to communicate between modules de- 
fined by HP Precision I/O Architecture. This includes the 
definition of transaction functionality, transfer si?:es. align- 
ment restrictions, and returned status information. In addi- 
tion to implementing the connect protocol, each HP Preci- 
sion system bus definition includes the timing of signal 
transitions* voltage thresholds of transceivers, power re- 
quirements, and other physical parameters. 

The HP Precision program provided a unique opportu- 
nity to upgrade all levels of the I/O system definition simul- 
taneously. The method used to develop the system was 
top-down definition coupled 1o bottom -up verification. 

The steps in top-down definition are architecture, pro- 
tocol, standards* and design. The I/O architecture is defined 
around a model established to meet the objectives. The 
architectural concepts define the required connect pro- 
tocoL The bus standards are defined based on that connect 
protocol, and the bus standards are used in the design of 
I/O cards. Fig. 1 illustrates the process. 

The simultaneous activity in the architecture and design 
phases of the definition were coordinated to provide con- 
stant feedback between the intermediate levels. The initial 
designs revealed flaws or incompletely specified portions 
of the bit-s standards. These were corrected in the bus stan- 
dards and the corrections were propagated up to the appro- 
priate higher level. Feedback also occurred between the 
bus standards and the connect protocol, and between the 
connect protocol and the 1/0 architecture. This controlled 
feedback process provided the design evaluations required 
to update the initial drafts of the I/O architecture, connect 
protocoL and bus standards documents. These documents 
are the basis for the design of the system components, or 
modules. 
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Fig. 1, Feedback paths In !he definition process for HP PfB' 
a sion I/O Architecture 
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Bus Options 

HP Precision I/O Architecture is based on functionaJ en- 
tities called modules. Tlie miniinal system consists of a 
processor, memory, and I/O modules attached to a single 
system bus, as shown in Fig. 2. The single-bus configuration 
is sufficient to support low-range and midrange products. 

For high-end products, multiple buses are required* as 
shown in Fig. 3. The processor and memory are connected 
to a higher- perform a nee HP Precision bus and the I/O mod- 
ules are connected to other low-cost buses. In this example, 
I/O connections to a "foreign" bus and a ''native'' system 
bus are illustrated. The native system bus, or simply system 
bus, implements the HP Precision connect protocol: the 
foreign bus does not. The native and foreign buses are 
connected through a bus iidnpteT module, and special soft- 
ware is required to support the connection. The bus adapter 
architecture allows I/O cards developed for other buses to 
be leveraged in the HP Precision system products, as dis- 
cus.sed later. 

Two native system buses can also be connected through 
a bus converter module. This connection is transparent to 
normal software operation. 

Based on the destination address of a transaction, the 
bus converter forwards the transaction to remote modules 
attached to a physically separate bus. The bus converter is 
not involved in local transactions between modules at- 
tached to the same bus^ Unless bus errors occur* the for- 
warding of a remote transaction is transparent to the mod- 
ule that originates the transaction. This allows the I/O 
driver software developed for a local module to be leveraged 
when the module is moved to a remote bus. Software 
changes are limited to the optional recovery of errors de- 
tected on the remote bus (the bus converter logs and isolates 
system bus errors]. 

The bus converter is implemented as a module pair; one 
module is attached to each of the tw^o system buses. The 
module pair can be physically separated and connected 
with a high-speed link [e.g., fiber optics)t as shown in Fig, 
3. This separation is required when the buses cannot be 
physically adjacent because of mechanical packaging con- 
straints or customer requirements to support remotely lo- 
cated peripherals. This w^ould be the case for large I/O 
configurations, processor clusters, or remotely located 
graphics and data collection peripherals. 

Module Addressing 

When a system bus is initialized, each module initially 
responds to a 4K-byte "hard'* physical address range. The 
module's 4K'byte address space is divided into 1024 32-bit 
I/O registers. Access to these L'O registers is provided by 
the read or write transactions defined by the connect pro- 
tocol. For example, writfi transactions are used to reset the 
system or a card, interrupt the processor, and initiate I/O 
operations. The more common I/O registers ^ such as those 



used for module identification and initialization, are stan- 
dardized to support autoconfiguration and simplify operat- 
ing system software, 

A 256K-byte address space, aligned to begin at a multiple 
of 256K bytes, iS provided for each system bus; this is 
sufficient to support 64 modules. The physical properties 
of card connectors, backplanes, and transceivers normally 
limit the number of card slots on a bus to 16, Thus» to 
provide a complete set of 64 modules on a system bus, 
hardware designers would be required to implement four 
modules on each card. For example, a multifunction card 
might consist of a processor, memory, and tw^o I/O modules. 
In generah not all cards have four modules and the bus 
address space is only partially used. 

The initial address space allocated to memory and I/O 
modules is not generally sufficient to support normal mod- 
ule operation. For these modules, one of the registers in 
the initial address space is used for dynamically assigning 
an exteoded address space, as shown in Fig. 4. The ex- 
tended address is alw^ays a power of two in size, and is 
aligned to a physical address that is a multiple of its size. 
To simplify configuration firmware and software, the ex- 
tended address space can be assigned independently of the 
module's initial hard address space. 

The initial 4K-byte address space of an I/O module maps 
to the supervisor element. Additional register sets, or I/O 
elements, are required to communicate directly with the 
attached devices. These I/O element registers are typically 
located in an extended address space, which is dynamically 
assigned by a writing to a supervisor element 1/0 register. 

To simplify the I/O driver software, a single I/O element 
(register set) is allocated for each device to be controlled 
by the software. Multiple devices are supported through 
multiple I/O elements. The architecture provides the design 
freedom needed to achieve a good match between physical 
hardware implementation and logical softw^are interfacing. 
For example, a disc controller implemented as a single 
physical device can interface to software through the ad- 
dress space of a single I/O element. A full -duplex terminal 
controller can be assigned two I/O elements, one for data 
input and one for data output. The software can thus service 
the inbound and outbound data streams independently. A 
terminal muttiplexer with eight full-duplex ports can be 
implemented as 16 I/O elements, allowing software to per- 
form independent I/O operations on each data stream. 
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Fig. 4. When thm initial address space ailocaled to a memory 
modufe or an IIO module fs loo small, one of the registers in 
the jnftmt address space can be used for dynamic assignment 
of an extended addfess space. A special ROM, lODC {HO 
dependent code), supports auloconfiguration. 

During normal operation, an L/0 device is controlled by 
accessing I/O element registers directly. When DMA or 
similar hardware on the I/O module is shared by multiple 
devices, the use of this shared resource is scheduled by 
the I/O module hardware, not the I/O driver software. This 
simplifies software and generally provides a more efficient 
mechanism for scheduling shared hard^vare resources. The 
I/O registers in the supervisor element are only used for 
module identification, initialization, and error recovery. 

Two sizes of I/O elements are defined, 128 bytes and 4K 
bytes. The packed version (126 b>1es) allows up lo 16 I/O 
elements to be packed into a single 2K-byte page. In the 
unpacked 4K-byte version, two pages are provided for the 
support of privileged and unprivileged I/O registers. Un- 
privileged registers are accessible thrr>ugh both pages; 
privileged registers are accessible only through the lower- 
addressed page. The higher-addressed page can be mapped 
directly into the user's virtual address space without com- 
promising system security. This allows many of the 1/0 
element registers to be accessed directly* without the over- 
head of calling operating system software. 

On a memory module, the extended address space maps 
to the module's RAM- Because the extended address space 
is automatically assigned, hardware switches are not re- 
quired to configure memory addresses. This improves the 
reliability of the card, and eliminates service calls caused 
by improperly selected switch settings. After initial config- 
uration, the supervisor element registers are read periodi- 
cally to update the system's memory error log, 

I/O Dependent Code 

As illustrated in Fig. 4 for I/O and memor\^ modules* 
each module contains card specific ROM called 1/0 depen- 
dent code, or lODC, which is accessible through standard- 
ized I/O registers. The content of the tODC is sufficien! to 



identif>^ the proper diagnostic and I/O driver software for 
the module. This is provided to support autoconfigiirable 
operating system software. Operator int erven U on is not re- 
quired to configure a new physical card. 

System initialization, or boot, involves the executioo of 
firmware code to initiate an VO operation on one of the 
boot devices, such as a disc. To minimize updates of pro- 
cessor ROMs, this firmw^are is split between the processor 
and the VO modules. The portion of the code shared by all 
I/O modules is located on the processor module. The primi- 
tive I/O drivers are provided by the I/O modules, and are 
called to initialize, test, and read data from the selected 
boot device. A stable HP Precision instruction set simplifies 
the support of lODC on I/O modules: new ROMs are not re- 
quired for each upgrade of the processor hardware. 

In addition to assisting system initialir^ation, the lODC 
ROM is used to distribute module self-test code, and can 
be used to insulate standard I'O driver software from the 
implementation dependent features of module identifica- 
tion, configuration, and error recovery. 



Address Space Allocation 

HP Precision HO Architecture uses a single 32-bit phys- 
ical address space. When a physical module is acce,ssed 
through a virtual address, the translation to a physical ad- 
dress is performed by the processor, and a physical address 
is used in the bus transaction. The physical address space 
is partitioned into two distinct spaces, the I/O address space 
and the memory address space, as shown in Fig. 5. 

Address spat:e is dynamically assigned. I/O addresses 
are assigned from the high end of the physical address 
space and memory addresses are assigned from the low 
end of the physical address space* This generates a compact 
address space assignment that minimizes the page table 
resources required to map virtual memory accesses. 

Initially, only the broadcast address space Is defined. A 
broadcast write transaction is used by a processor to in- 
itialize the 256K bytes of address space for its bus. Addi- 
tional address space is assigned to other buses and ex- 
tended module addfess spaces as required. The extended 
address space for I/O modules and memory modules is 
allocated from the available I/O address space and memory 
address space, respectively. 

The words in the I/O address space correspond to I/O 
registers. Software references to these registers are pro- 
cessed differently from memory transactions; the load or 
store instruction triggers a bus transaction rather than a 
data cache access. The fixed partitioning of 1/0 and memory 
addresses simplifies the processor hardware required to 
identify the 1/0 register accesses, which bypass the data 
cache. 

The dynamic allocation of the address space allows the 
address space to be assigned to additional buses or I/O 
elements as required to support the .selected hardware con- 
figuration. Although the total physical address space is 
limited, the number and size of modules that can be sup- 
ported are quite large, as shown in the table on the next page. 
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HP Precision Architecture Configuratfon Limitations 

(Approximate) 



I/O Address Space 

Total I/O Address Space 
System Buses (256K bytes each) 
Processor Modules f4K bytes eacb] 
Packed I/O Elements ( 1 28 bytes each) 

Memory Address Space 

Total RAM Configured 



256M Bytes 

1024 

64K 

2M 



3.75GBytes 



Connect Protocol 

HP Precision I/O Architecture defines a standard soft- 
ware interface to module registers, independent of the 
physical bus standard. To implement this interface, and to 
support transparent forwarding of transactions through bus 
converters, a single connect protocol is defined for all sys- 
tem buses. 

The connect protocol defines the required and optional 
transactions for all system buses. These transactions are 
initiated by n master^ and invoke a response from one or 
more slaves. For a read transaction, data is transferred from 
the slave to the master* For a write transaction, data is 
transferred from the master to the slave- For a broadcast 
transaction, data is transferred from the master to all slaves. 

Although the data transfer sizes are different for 1/0 and 
memory transactions, the basic format of the transactions 
is maintained, as shown in Fig. 6. 

During the address phase, the address of the transaction 
is asserted on the bus. The bus address of the master {master 
ID] follows, and is sufficient to identify tbe module initiat- 
ing the transaction. The master ID is transmitted while the 
slave Is decoding its address, and generally has a minimal 
impact on system performance. 

The master ID field is required to resolve potential dead- 
lock conflicts when transactions are forwarded through bus 
converters. The field is also used by the smart-cache pro- 
tocols to maintain consistent copies of data in the cache 
lines of processors attached to separate buses. 

Only a small set of data transfer sizes is defined. The 
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basic 4-byte and 16-byte sizes, and the larger optional trans- 
fer sizes (32, 64, ,..) are all powers of two in size, and are 
described later in this article. Support of additional transfer 
size options in the memory address space would have in- 
creased the cost of memory modules, since they support 
all of the options. 

At the conclusion of the transfer, status is transferred 
from the slave (or slaves) to tbe transaction master. It the 
slave detects an error [such as a double-bit memory error), 
an error condition is reported to the master, to prevent the 
use of corrupted data. For transactions that are correctly 
specified, but cannot be completed immediately, a busy 
status is returned and the transaction is automatically re- 
tried by the master. The busy status is required to avoid 
deadlocks in bus converters and is also used by the special 
transactions provided to maintain cache consistency in a 
multiprocessor environment. 

Parity or alternative forms of error checking protect the 
transaction and slave addresses, master ID, data, and status 
signals. When control signals cannot be parity protected, 
their values and timing are designed to simplify the detec- 
tion of faults through alternative mechanisms (bus time- 
outs, for example). 

Separate transaction types are provided in the I/O and 
memory address spaces. This allows the data transfer size 
to be optimized for its intended use. The read and write 
transactions in tbe 1/0 address space are designed to access 
I/O registers, which are words [four bytes in size and align- 
ment). Simple cost-sensitive cards may implement only 
the least-significant byte of each I/O register. 

Transaction Types 

Based on the requirements of processors and DMA-based 
I/O modules, transactions in the memary address space are 
optimized for burst data transfers. The CPUs use burst trans- 
fers to read or write cache lines. The DMA-based I/O mod- 
ules use burst transfers to process buffered data packets 
efficiently. Nibble-mode and static column RAM technol- 
ogies bave minimized the cost of supporting the high-per- 
formance burst-mode transfers on memory modules. 

All buses support the smallest (16-byte) memory address 
space transaction. This quad -word transfer typically uses 
50% of the peak bus bandwidth. Larger burst transfers (e.g,, 
32 and 64 data bytes) are options, and are not defined for 
all system buses. If the transfer size is defined in the bus 
standard, it is supported on all modules responding as 
slaves in the memory address space, and is optionally used 
by the transaction masters (processors, the DMA-based L/O 
modules, and bus converters). In general, the low-cost 
DMA -based f/O module designs use 16-byte transfers, and 
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high-performance DMA VO module designs use the largpst 
transfer size defined by the bus standard. 

Two special types of transactions are defined by the con- 
nect protocol: broadcast and semaphore. To simplify their 

implementation, the architecture constrains the use of 
ihese nonstandard transactions. Broadcast transactions are 
only used to update I'O registers in the supervisor element. 
A write to a register offset in the broadcast portion of the 
I/O address space is equivalent to a sequential set of writes 
to the same register offset in each of the supenisor ele- 
ments. Broadcast transactions to VO elements or to the 
memory address space are not defined. These and other 
generalized uses were not required; supporting them would 
have needlessly complicated the module designs. 

On many of the industr\^ standard buses, semaphore op- 
erations are implemented by the processor, which requires 
the definition of an indivisible read and write transaction 
pair. Although this transaction pair has been used success- 
fully in previous designs, it is difficult to forward through 
bus converters, and increases the complexity of high- 
bandwidth pipelined bus standards. 

In the HP Precision connect protocol* the semaphore 
operation is implemented by memory module hardware, 
and has minimal impact on the complexity of bus stan- 
dards. The semaphore transaction has a unique command 
code, but is otherwise identical to the quad- word read 
transaction defined in the memory address space. Like the 
read, the semaphore is recognized by the memory control- 
ler, and four words of data are returned from RAM. The 
semaphore transaction is distinguished by an important 
side effect— the first word at the quad address in RAM is 
cleared as the transaction completes. This is sufficient to 
implement the semaphores defined by the HP Precision 
instruction set. 

Module Interrupts 

In any computer* when a module such as an I/O device 
requires special service from a processor, the other tasks 
must be interrupted. The interruption mechanism enables 
the processor to respond quickly to high-priority interrupts 
while queuing and eventually servicing large numbens of 
low-priority interrupts* all with minima] performance over- 
head on the processor 

HP Precision 1/0 Architecture defines a very simple in- 
terrupt system that requires little special hardware and 
allows great flexibility in the processor's response to each 
interrupt. A key aspect of this interrupt system is the assign- 
ment of interrupt control to software. The architecture gives 
software the power to assign arbitrary interrupt priorities 
to all modules, direct each module's interrupts to any pro- 
cessor in the system > and selectively process or queue in- 
dividual interrupts or priority levels. 

When a module needs attention or service from a proces- 
sor, the module communicates its need to the processor's 
external interrupt retpjest register by using the same single- 
word, memory mapped write transaction used for all other 
intermodule communication- This ensures interrupt re- 
quests can be passed from any module in the system to 
any processor in the system without requiring specialized 
interrupt hardware. Also, since the connect protocol de- 
fines broadcast transactions to be a special case of single- 



word write transactions* a module can broadcast its inter- 
rupt request simultaneously to alt processors in the system. 

Like the other transactions defined by the connect protocol* 
the interrupts propagate transparently through bus convert- 
ers, and can be sent to a processor on any system bus. 

Interrupts in HP Precision I/O Architecture differ from 
most other designs, which interlock the low -priority de- 
vices while the high-priority tasks are being executed. This 
interlock was discovered to be inefficient for uniprocessor 
and unreliable for multiprocessors. For uniprocessor con- 
figurations, this interlock would require that the interrupt- 
ing module retry the write to the processor's interrupt reg- 
ister until it is completed successfully. The repeated trans* 
actions are an inefficient use of bus bandwidth. For a two- 
processor configuration, this interlock generates a potential 
hardware deadlock. For example, when two processors are 
executing separate high-priority tasks, and softw^are on each 
processor sends a lower-priority interrupt to the other, both 
processors become deadlocked. 

Interrupt Groups Hardware 

HP Precision processor interrupts are based on hardware 
support of 32 interrupt groups. Software assigns one of 
these groups to an I/O element before an I/O operation is 
initiated. The value of the interrupt group is returned to 
the processor when an interrupt occurs. Software can inde- 
pendently disable any one or more of the interrupt groups, 
delaying their processing to a more convenient time. This 
is simpler and more flexible than architectures that set the 
interrupt priority in special- purpose hardware^ restricting 
the ability of software to modify the order in which inter- 
rupts are processed. 

Fig- 7 shows the functionality of the interrupt hardware 
that supports the interrupt groups. The interrupt system 
hardware consists of one register (the external interrupt 
message orElM register) on each I/O element that generates 
interrupts, and two registers [the external interrupt enable 
mask or ElEM register and the external interrupt request 
or EIR register) on each processor. Before an t/O operation 
is initiated, software writes a 32-bit value to the EIM register 
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of the I/O element. This value includes both the address 
of the processor to be interrupted and a five-bit encoding 
of the interrupt group assigned to the I/O element. When 
an element needs service, the slngie word in its EIM register 
provides both address and data for a single-word write 
transaction. The address determines the processor to be 
interrupted, and five bits of the data specify the interrupt 
group bit to be set in the processor's 32-bit EIK register. 
Each bit of the EIR register is continuously ANDed with the 
corresponding bit of the EIEM register, and if any bit of the 
result is true, tJie processor is interrupted. Software has 
complete control of the EIEM register to specify the inter- 
rupt groups that are recognized. As software services each 
interrupt, it clears the associated bit of the EIR register to 
prepare for the next Interrupt. Software running on one 
processor is able to interrupt another processor simply by 
writing the appropriate data value to the processor's EIR 
register. 

Although the architected interrupt system is fast and 
flexible, the information provided to software is minimal 
[only the interrupt group is knowoj. In a system with many 
I/O elements, each of which must interrupt the processor 
to signal its completion of an assigned task, many of the 
interrupt group bits in the EIR register are shared. Unless 
an alternative mechanism is provided, the processor soft- 
ware would be burdened by the overhead of polling the 
I/O registers on l/'O elements to resolve the source of inter- 
rupts that map to a shared interrupt group bit. A more 
efficient mechanism is the status chain feature, which is 
associated with DMA modules and is described below. 

DMA Module Capabilities 

Direct memory access, or DMA, is defined as an optional 
feature of an I/O element in HP Precision I/O Architecture. 
DMA is simply the transfer of data between the L'O element 
and system memory without intervention by a processor. 
The primary objective of DMA is to mioimize the effort 
required of the processor to support I/O transfers, A high- 
performance DMA model allows the data to be transferred 
efficiently to system memory, minimizing the need to pro- 
vide operating system specific data processing hardware 
or firmware on the I/O card. 

A uniform DMA model is defined by the L/'O architecture 
and supported by the connect protocol The DMA modules 
access system memory using the same bus transactioos that 



processors use. All DMA elements present the same mem- 
ory mapped register interface to software, and softw^are com- 
munication to initiate DMA activity uses the single- word 
memory mapped transactions defined for communication 
with other I/O registers. A uniform definition of the I/O 
registers in the DMA hardware interface simplifies the soft- 
ware interface, since many of the DMA software utilities 
can be shared by all of the DMA-based I/O software drivers. 

To simplify the connect protocol and processor cache 
designs, all DMA transfers are performed directly to mem- 
ory, and are not affected by the contents of the instruction 
or data caches. Shared utilities in the I/O driver software 
use the cache flush and purge instructions to maintain 
consistent copies of data in the processor caches and mem- 
ory module RAM. 

The DMA element activity is initiated by writes to two 
of the I/O registers on the DMA element. The first I/O regis- 
ter holds the address of the DMA command data structure 
in memory. The write to the second I/O register triggers 
the fetching of these DMA com mauds from memory for 
execution by the DMA element. The command data struc- 
tures in memory are organized as a sequence of DMA re- 
quests, as shown in Fig, 8, Each DMA request is organized 
as a sequence of four- word data structures, or quads. The 
quads are aligned to an address that is a multiple of their 
size. 

The data structures are based on linked lists of quads * 
rather than a less flexible set of sequential table entries. 
The four words of the quad include a pointer to the next 
quad in the chain, a command for the DMA element and 
two arguments for the command. The DMA element exe- 
cutes the successive commands in the chain of quads, au- 
tomatically advancing from one quad to the next without 
processor intervention. The quad chains can be of arbitrary 
length, and can be dynamically extended as required to 
queue additional DMA requests. This Is accomplished by 
changing the pointer in the final quad from its previously 
null value to the address of the first quad in the chain to 
be appended. 

Each quad chain consists of one or more DMA requests. 
In general, each request corresponds to a separate call of 
the I/O driver software. For shared 1/0 devices, such as a 
file system disc, the I/O driver software is expected to ap- 
pend multiple requests for sequential processing by the 
DMA element. Each request is typically partitioned into 
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three phases. The quads in the initialize phase provide the 
parameters for the following data transfer, such as the 
media address and length of the total disc transfer. The 
quads in the data transfer phase define the physical memon^ 
addresses involved in the DMA data transfer. The quad in 
the status phase is used to inform the processor when the 
request completes. 

The chaining of quads in the data transfer phase allows 
the data to be transferred to noncontiguous ranges of phys- 
ical addresses In the menior>' address space. This is useful 
in the virtual memory environment provided by HP Preci- 
sion Architecture. An LO request processed by the L'O 
driver software typically involves a transfer to or from a 
contiguous range of virtual addresses. Software converts 
the virtual address range into a set of noncontiguous phys- 
ical addresses, and generates the corresponding chain of 
quads for use in the data transfer phase of an operation. 
Only data transfers to or from the memory address space 
are defined, and DMA input is aligned to a multiple of 64 
bvtes. Arbitrarily aligned DMA data transfers and transfers 
to the I/O address space are not required, and would have 
complicated the hardware design of DMA modules. 

The status phase of a DMA request returns a summary 
of the DMA element's status to an entr>' stored in memory'. 
The status information is sufficient for software to complete 
the processing of successful DMA requests. Unless errors 
occur, this summary is sufficient to allow a DMA element 
to continue processing additional requests without soft- 
ware intervention. The status phase generally consists of 
a single quad, called the link status quad. The link status 
quad instructs the DMA module to write its own status to 
a completion entry in system memory (the entry address 
is specified by Arg^). The completion entry is inserted into 
a linked list in memory (the value of Argi specifies the 
address of the list head), and a processor interrupt is option- 
ally generated. The linked list ol completion entries is 
called the completion list. The ordering of entries in the 
completion list is LIFO (last in. first out) to minimise the 
complexity of the hardware implementation. By software 
convention » each of the completion lists is assigned to a 
unique prtjcessor interrupt group. 

Hardware updates four words of the completion entry, 
and software conventions define additional words of data 
in the entry, such as the address and arguments for the 
interrupt service routine. Before the DMA request is ini- 
tiated, these parameters are sav'^ed in the space reserved for 
the completion entry. When an interrupt is received, soft- 
ware decodes the interrupt group to select the completion 
list to be processed. The data saved by software in the 
completion entry is used to dispatch quickly to the proper 
interrupt service routine. 

Bus Adapters 

i'omi^n buses are buses that do not conform to the specifi- 
cation of the HP Precision connect protocol. They can be 
connected to a system bus through a bus adapter. The bus 
adapter allows HP to preserve the investment in previously 
developed products when migrating to the HP Precision 
I/O system. Cards developed for the proprietary HP-CIO 
backplane are used in the first HP Precision products. By 
allowing the use of proven I/O technologies and systems* 



the bys adapter has accelerated the design cycle of the 
initial HP Precision Architecture implementations. 

The first bus adapter to be developed for the HP Precision 
LO system is the HP-CIO channel adapter (see Fig. 9). This 
adapter is fully compatible with all existing HP-CIO I/O 
cards, and with HP-OO cards currently in developinent. 
Although the HP-CIO protocol differs horn the HP Precision 
connect protocol in many ways^ the bus adapter maps all 
of the necessary HP-CIO functions into the standard register 
interface through which it communicates with the HP Pre- 
cision I/O system. In accordance with the HP-CIO protocol . 
the channel adapter serves as a central time-shared DMA 
controller on the HP-QO bus. The adapter is the initiator 
of all HP-CIO bus transactions, and it is the arbitrator that 
manages the allocation of the HP-CIO bus bandwidth. As 
a bus adapter, the HP*C10 channel adapter provides data 
buffering and address generation as it transfers data be- 
tiveen the I/O modules on the HP-CIO bus and the memory 
modules on other buses within the HP Precision I/O system. 
The adapter also translates interrupts and error messages 
into the protocol used by the MP Precision I/O system. By 
handling all normal DMA transfers and the majority of 
error conditions in complete autonomy, the HP-CIO chan- 
nel adapter can greatly reduce the processor overhead re- 
quired to operate the HP-CIO bus. Except in the rare error 
case that requires software intervention, the HP-CIO chan- 
nel adapter appears to the HP Precision i/O system as a set 
of DMA I/O elements that conform to most of the specifica- 
tions of HP Precision I/O Architecture. 

In the future, the bus adapter module can also be used 
to support other foreign buses, such as VME, To support 
these cards, bus adapter hardware and I/O driver software 
are required to convert between the HP Precision I/O Ar- 
chitecture and connect protocol and the conventions of the 
foreign bus. For example, interrupts and DMA transfer pro- 
tocols are usually different^ and need In be converted. Al- 
though other foreign buses share many properties, their 
features require special considerations in the design of each 
bus adapter. 

The leverage of foreign VO card designs is not achieved 
without cost. Special bus adapter hardware is required* 
autoconfiguration capabilities are reduced, and software 
complexity is increased. Autoconfiguration features are 
generally not available on foreign buses. This typically 
limits the assignment of boot devices to preallocated slots 
on the bus* or requiresa bus adapter ROM update to support 
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new boot devices. Access to foreign I/O cards is indirect, 
and involves software mteractions with shared bus adapter 
resources to initiate an I/O operation^ implement the DMA 
transfer to memory^ or convert between intemipt protocols. 
This overhead increases the complexity of I/O driver soft- 
ware, 

I/O System Summary 

By adhering to the strategies of simplicity and unifor- 
mity, many benefits were realized. 

Simplicity is illustrated by the alignment of addresses 
and address ranges, the minimal number of transactions 
defined in a single connect protocol, and the implementa- 
tion of processor inteirupts (the EIR and ElEM registers) 
Uniformity is iUustrated by the transfer of interrupts be- 
tween modules (an existing word write transaction is used), 
the definition of standard module I/O registers for identifi- 
cation and configuration of modules (including processors 
and memory), and the use of a single connect protocol for 
all bus standards. 

The verification of the architecture through actual de- 
signs has shown the benefits of meeting the original objec- 
tives. Scalability is achieved through simplicity, and the 
architecture makes things "as simple as possible, but not 
simpler.'' "Not simpler" means that concepts and compo- 



nents are designed to meet the global objectives, rather 
than only the needs of a local design center. Components 
in the system are interchangeable, so the cost of developing 
them is amortized by their use on many different systems. 

The biggest benefit HP's customers will see conies from 
the flexibility and the identical support of common compo- 
nents. Identical support for common components provides 
transparent migration to faster components, and more or 
faster buses. This migration can be accomplished with min- 
imal perturbation of the customer's software and/or work- 
ing environment. 
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HP Precision Architecture Performance 
Analysis 

Performance analysis was crucial to instruction set 
selection, CPU design, MIPS determination, and systenn 
performance measurement. 

by Joseph A. Lukes 



HEWLETT-PACKARD PRECISION ARCHITECTURE 
is a key component in Hewdett- Packard's computer 
strategy for systems well into the next decade. This 
article is intended to be a brief overview of the contribu- 
tions of a collection of people from HP's performance evalu- 
ation community in the evolution of this strategy. It de- 
scribes the role of these performance groups in the design 
and measurement of the architecture, and in the CPU design 
and systems measurement techniques that have led to the 
computer systems based on this architecture. Presentation 
of measured performance data will not be done in this 
article, but will be left to later papers in this and other 
journals. 

Selection of the Insiruction Set 

The creation of HP Precision Architecture combined the 



expertise of highly experienced specialists In computer 
hardware design, compilers, operating systems, architec- 
ture, and performance analysis. The architecture team had 
investigated a number of papers on reduced mstniction set 
computers^ and the general conclusion was that a reduced 
instruction set computer was a feasible vehicle with which 
to migrate EIP from its HP 1000. HP 3000. and HP 9000 
Computers to a common architecture. The purpose of this 
section is to describe the efforts of the HP Laboratories 
performance analysis team in creating the data used to 
select the instruction set of the new architecture. 

A team of performance analysts was chartered to extend 
the studies described in reference 2, To achieve these ob- 
jectives, an Amdahl V6 computer was acquired and an 
interpretive instruction tracer program similar to that de- 
scribed In reference 2 was created. This program operated 
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Oi&tnl^ution of IBM 370 Instfuclrons by Frequency 
Language 
Instruction COBOL Portran Pascal 
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Fig. 1, Art eKampfe of the type of ^ata gathered to aid HP 
Precision AfcMeciure fnsiruction se! sefeciton. 

under the IBM VM/CMS operating system and gathered 
raw data from a variety of benchmark programs run on the 
V6 by interpreting and simulating each instruction. Spe- 
cifically , we gathered the following data: 

■ Instructions executed and sequence in which executed 

■ Virtual address of instructions 

■ Virtual address of each operand 

■ Identification of registers used by the operation 

■ Contents of each operand for certain operations. 

This basic data allowed us to derive valuable statistics, 
among which the following were of greatest value: 

■ Dynamic frequency of instruction occurrences 

■ Address traces for instruction sequences and for data 
referenced by these instructions 

■ Characteristics of operations such as the number of 
characters used in a move operation, operand values in 
arithmetic operations, distances branched » etc- 

■ Frequency of operation pairs. 

Fig.l illustrates the type of data gathered. Here the distri- 
bution of classes of instructions for COBOL, Fortran, and 
Pascal benchmarks is shown. Fig. 2 shows the distribution 
of time spent in these benchmarks per instruction class. 
Note that the bulk of the operations are simple loads, stores, 
and branches. Other operations occur relatively infrequently 
but can take a much larger amount of time. By distribuLioii 
in timer floating-point operations for Fortran and Pascal, 
and slorage-lo-storage move operations for COBOL are im- 
portant instructions. (Storage4o-storage moves can be 
simulated by a sequence of load/store instructions). 

Most of the programs we tested exhibited these charac- 

Oistributton of IBM 370 Instruclions by Time 
Language 
Instruction COSOL Fortran Pascal 
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tenstics. The bulk of operations, both in frequency of use 
and time, are simple and dominated by the load, store* and 
branch operations. Since simpie operations appear to domi- 
nate the frequency of instructions in a computer, the con- 
cept of cycle-per-instruction architectures ha.s arisen, 

Such information is just beginning to appear in computer 
science literature,^ Reference 3 points out that the fairly 
complex instruction sets of most computers are really not 
as helpful to the compiler writer as might be thought. In- 
structions such as the IBM 370 MVC {move character) and 
MVCL (move character long) are examples of instructions 
that might profitably have been left to a simple set of load 
and store operations. These instructions move any number 
of contiguous bytes [from one to sixteen million). However^ 
Fig. 3, derived from our benchmark studies* show^s that 
storage-to-storage moves really only move a small quantity 
of data. Reference 2 found the same central truth. Why 
bother with really sophisticated movers of characters like 
MVC and MVCL when a simple load store combination in a 
small loop can outperform the more sophisticated move 
instructions? 

Another interesting observation made from looking at 
the instruction mixes of a variety of benchmarks is that the 
typical mix does not seem to be much a function of the 
t^^pe of work that is being done. For example, technical 
work (such as large Fortran simulations or CAD/CAM] and 
commercial work (such as old master in, new^ master out 
or data base work) show the same characteristics: loads^ 
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stores and branches make up the bulk of operations. 

Fig. 4 shows the instruction mix from an HP 3000 Com- 
puter during a peak period (predDminant)y data base inten- 
sive). Compare this with Fig. 2 and you will find that, other 
than the extensive use of floating-point operations in scien- 
tific work, they are very similar. One myth that did not 
geem to be borne out by the benchmark work that we did 
is that commercial (COBOL) jobstreams require a large 
amount of decimal arithmetic. We cannot find any evidence 
to back this assertion. 

Another rather nonintuitive result of the early bench- 
mark measurements is that a speciahzed integer multiply/ 
divide coprocessor, uniike the results for floating-point, is 
probably not worth the extra expense. Fig. 5 shows the 
results of the measurement of a large Fortran program's 
use of multiply operations and the associated operand 
value distributions. At least one operand in the vast major- 
ity of cases is small [less than 500). making the techniques 
mentioned in reference 3 quite feasible with a net improve- 
ment in performance^ 

As a result of these studies and a variety of others, the 
compiler, architecture, hardware, and operating system 
teams established and refined the architecture to a set of 
instructions for HP Precision Architecture. The architec- 
ture, since it was ba.sed on measurements of a large sample 
of workloads, evolved from a simple RISC machine to the 
far more sophisticated operation set and computer organi- 
zation outlined in reference Un summary, the conclusions 
drawn by the team selecting the instructions for HP Preci- 
sion Architecture were: 

■ Simple instructions are most of what is executed in a 
wide variety of work. 

■ There are complex instructions that occur frequently 
enough (e.g.. floating-point) to justify a special set of 
hardware to execute them, hi HP Precision Architecture 
CPUs these are known as coprocessors or special func- 
tional units. ^ 

■ Load/store (move a 32-bit word) architectures make sense 
since they permit high-speed general registers to be used 
effectively as the first level of the storage hierarchy. 

■ Simulate complex but infrequent operations so that the 
underlying instruction set can be as simple as practica- 
ble. 

The next section describes the efforts involved in select- 
ing the parameters associated with the family of central 
processing units based on HP Precision Architecture, 
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HP Precision ArchEtecture Computers 

The previous section outlined how a set of instructions 
was ctiosen. each of wtiich^ withi few exceptions, executes 
in one major cycle of the central processing unit's clock. 
For example, a central processing unit (CPU) with a IO-MHk 
clock would bea 10-mil!ion-instruction-per-second (MIPS) 
processor with a cycle-per-instruction architecture. This, 
of course, assumes no delays as a result of cache or TLB 
{translation lookaside buffer) misses. Complex instruction 
set computers (CISC], on the other hand, sacrifice cycles 
of the CPU to execute more functionally complex instruc- 
tions sets, generally through the aid of microcode. 

Before describing kow the CPU family associated with 
HP Precision Architecture was design ed^ a few^ words need 
to be said about computer system performance. A very 
popular measure of the power of a computer system is to 
specify the number of MIPS (millions of Instructions per 
second) that the system's central processing unit(s] can 
execute. This measure is an estimate of the capacity of the 
CPU to execute the work asked of it. The higher the MIPS 
value, the faster the work is done by the CPU. However, 
computer systems are not just CPUs. Indeed, they consist 
of peripherals, interconnections, main storage, applica- 
tions, operating systems, data commimicatlons subsys- 
tems, data base subsystems, compilers, and an entire set 
of policies for scheduling and billing that affect the perfor- 
mance of the computer system. Because of this plethora of 
variables in the equation thai determines the performance 
of a given computer system, people have tended to concen- 
trate on the relative simplicity of MIPS, 

The customer who purchases a computer system is gen- 
erally not really interested in the capacity of any one com- 
ponent of that system t such as the CPU, to do work. The 
customer is concerned with the response time with which 
work is completed, the throughput in jobs per unit of time, 
the number of active terminals connected to the system. 
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and so on. 

The next section will describe how we extended the 
metrics used to evaluate the HP Precision Architecture GPU 
family to systems performance. The rest of Ihis section will 
concentrate on how we determined the MIPS performance 
of the first members of the HP Precision Architecture CPU 
family. 

In creating a new architecture, the engineers and scien- 
tists charged with its development are faced with a large 
problem— how do you measure the effects of the architec- 
ture on MIPS, system through put, response time, working 
set size^ etc., when you do not have a CPU built from the 
architecture? To be sure, one can prototype the architec- 
ture, but the cost becomes prohibitive for all but a small 
number of prototypes. What follows in this section is a 
description of the hierarchical approach taken to develop 
the HP Precision Architecture processor family with mini- 
mum possible cost and maximum performance. 

Let us first examine the MIPS value. It is relatively easy 
to calculate; 

MIPS = [(cycles per instmction) x (CPU cycle time)]^^ 

The CPU cycle time is the period of the major clock in 
the CPU. This value varies from 100 to 200 nanoseconds 
for transistor- transistor (TTL] logic, and from 20 to SO ns 
for emitter-coupled (ECL) or HP NMOS-III logic. The MIPS 
capacity of a CPU can be increased by reducing the CPU 
clock time by means of new technologies. Many examples 
of this trend are seen in the current offerings of many com- 
puter vendors, including HP. 

The other variable in the MIPS equation is the number 
of cycles per CPU instruction. The lower this number, the 
higher the MIPS value will be. As obvious as this seems, 
there is still raging controversy over the efficacy of the 
cycle-per-instruction architectures (i,e., RISC architec- 
tures}, since critics of the red need -complexity approach 
claim that numerous things can happen that tend to lower 
systems performance when one attempts to reduce the cy- 
cles per instruction (CPl) to one. It is not the purpose of 
this paper to argue either side. All of our work at HP so 
far, however* has pointed out that the HP Precision Ar- 
chitecture does not appear to limit the ability to make very 
high-MlPS computers from existing technologies through 
reductions in the CPI (cycles per instruction) and that such 
CPUs are capable of offering systems performance compar- 
able with CISC machines of the same MIPS rating. Details 
of this work will accompany specific product announce- 
ments. 



the CPUs designed to implement it. a simulator was WTitten 
(see article, page 40). The simulator extended our ability 
to evaluate design trade-offs by allowing the user to pro* 
gram simple kernel programs, either by hand or through 
the use of a portable C compiler. These kernels were chosen 
for their lack of VO ( no operating system existed) and for 
their simplicity of compilation (only primitive compilers 
existed I . A number of invaluable experiments were run on 
the simulator and continue to be run on it even today. 

However, a simulator is relatively slow and expensive 
to use and the number of experiments involved in choosing 
the parameters of the CPU family became too much for the 
simulator alone. Another problem with simulators is that 
they do not convince the skeptical that a revolutionary new 
architecture can actually be implemented in existing 
technologies such as CMOS, TTL, ECL, or NMOS, As a 
consequence, a prototype HP Precision Architecture CPU 
was built. It was named the LESS (low-end Spectrum sys- 
tem) and. although very simple compared to the products 
recently announced, it was a fuUy functioning HP Precision 
Architecture CPU that achieved about 0.8 MIPS. 

The LESS prototype gave software developers very early 
access to the architecture, and served as a vehicle for ex- 
perimentation for the architecture^ hardware, compiler, 
operating system, and performance teams. An interesting 
and useful tool that came out of the LESS prototype was 
an analyzer board that could be connected to the HP 64000 
Logic Development System (more about this later). 

As useful as the simulator and the LESS prototype were, 
there were parameters of the CPU designs that these tech- 
niques could not determine without untoward expense and 
time. Only very simple johs could be run through the 
simulator and prototype since there were no operating sys* 
terns or product-level compilers available. Consequently, 
the technique used in selecting the instruction set for HP 
Precision Architecture was used again, that is* traces on 
the Amdahl V6 and on HP 9000 and HP 3000 Computers. 
The instruction mixes for these computer systems were 
measured and used to simplify the CPU designs for 
minimum cycles per instruction. In addition, address traces 
were used to generate families of cache and translation 
lookaside buffer (TLB) statistics,'' These measurements 
were then used to calculate the cycles per instruction for 
a proposed CPU design. 

The cycles per instruction (CPI) value is, in simple terms, 
a function of the instruction mix. the parameters of the 
cache and TLB, and the CPU design: 

CPI = basic instruction time -i- fi(cacheTLB) -Hf^finter locks) 



Simulator and Prototype 

An earlier section described in part how the instruction 
set was chosen. At first, a propoised set of instructions was 
chosen from the collection of written data on cycle-per-in- 
struction architectures. Then experiments were run on in- 
struction mixes derived from other architectures that could 
be measured {i.e., the IBM 370 .set on the Amdahl V6 and 
the HP 3000). Final lyn analyses were done to select the 
instruction and register sets. 

Since analysis alone ctndd nni take into account the ef- 
fects of various design alleruatives of the architecture and 



where the basic instruction time is 1 cycle for most HP 
Precision Architecture instructions, ft(cache,TLB) is th© 
contribution to CPI of cache and TLB misses, and f^l inter- 
locks) is the contribution to CPI of the CPU design. QSC 
machines tend to have basic instruction times of 4 to 10 
cycles for the average instruction. The cache and TLB penal- 
ties and interlock penalties are not really affected by the 
architecture to any great extent. 

MIPS Model 

We have developed a relatively simple model of the MIPS 
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performance of HP Precision Archiletture CPU implemBn- 
rations based upon the mixture of instructions and miss 
rates of the different cache and TLB sizes and organizations. 
Fig. B Illustrates the instruction mix and Fig, 7 the cache 
and TLB curves for one benchmark measured on the Am- 
dahl V6. Before fully functioning HP Precision Architecture 
systems were available, such curves were used to design 
the CPUs. 

For example. let the cache and TLB designs be such that 
the cost of a cache miss is 20 cycles and that of a TLB miss 
is 100 cycles. Assume that we are calculating the MIPS of 
a processor whose basic instruction times are 10 cycles for 
floating-point operations and 1 cycle for all other opera- 
tions. Moreover, the pipeline design dictates a load/use 
penalty of 1 cycle per occurrence (here a load/use pair 
consists of a load followed by either a load or a store oper- 
ation]. 

Using the workload characteristics of the Fortran pro- 
gram depicted in Fig. 6, the basic instruction time is 
10[0;126) + 1(1 -0-126] or 2.13 cycles per instruction for 
instructions alone. 

From Fig. 7 , the cache and TLB misses^ assuming the 
memory system designs depicted in Fig, 7, are 

Cache miss rate = 3.5% {for an 8K-byte cache) 

TLB miss rate = 0.2% (for a 512-entry TLB) 

Consequently, the cache contribution to f^ (cache, TLB) is 
(0.035)[1 + 0.348 + 0.154)(20), where the first factor is the 
miss ratio of this particular cache, the second is the number 
of instruction and data references per instruction (one for 
the instruction itself and a probability of 0.348 of the in- 
struclion's being a load or 0.154 of its being a store )» and 
the third factor is the penalty of a miss, or 20 cycles. Thus 
the contribution of cache misses to the number of cycles 
per instruction is 1,05 cycles per instruction. 

In like manner* the contribution to the CPl of the TLB 
misses is 0.3 cycles per instruction. 

Finally ^ for this ver>^ simple model of the components 
of MIPS, the contribution of interlocks to the CPI consists 
of the above-mentioned load/use interlock, (0.06)[1 cycle 
per instruction], plus the penalty paid for no-op [no oper- 
ation) instructions,'* or (0,02](1 cycle per instruction), for 
a total Gontjibution of l^. (interlocks] = 0.08. 

The value of CPl, the cycles per instruction for this 
machine design and this benchmark, is CPI = 
2.13 + 1.05 + 0.30 + 0,08 =^ 3.56 cycles per instruction. Note 
that this workload has a very high-floating point content, 
so the CPI value is larger than for most workloads. 
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The MIPS value for a lOO-ns clock implementation of 
this design and benchmark would be (100 ns per cycle x 
3.56 cycles per instruction)"* = 2.81 million instructions 
per second. 

If one could replace the floating-point operations with 
integer operations, and if each integer operation took 1 
cycle per instruction, the MIPS value of this design would 
he 4.10 million instructions per second, since the CPI in 
this case would be 2.44 cycles per instruction instead of 
3.56. 

The ideal RISC-design processor would have a MIPS rat- 
ing of 10 or more, assuming one cycle or less per instruc* 
tion. 

Measurements on Actual Processors 

Today we base our instruction mixes, cache and TLB 
curves, and other CPU parameters on measurements made 
on the HP 3000 Series 930 and HP 9000 Model 840 proces- 
sors. Figs, 8 and 9 are examples of these measurements* 
Future HP Precision Architecture machines are being de- 
signed with this data. The logic analys^er board mentioned 
above and I he HP 64000 Logic Development System have 
been invaluable tools in debugging, analyzing, and tuning 
the new HP 3000 and HP 9000 processors. Fig. 10 shows 
a curve of MIPS versus time for the HP 9000 Model 840 
derived using the logic analyieer board. Figs. S and 9 were 
also measured by the analyzer board. 

In practice, the MIPS performance of a CPU varies with 
time and workload. HP has used heavily loaded values, as 
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Eneasured with the analy^r, for the specifications for the 
new HP 3000 and HP 9000 CPUs. However, a computer 
system consisls of far more than the central processing 
unit* The next section wHJ describe how the entire com* 
puter system is tracked through its design with extensions 



of the techniques used in deigning the architectoie and 
processors. 

Systems Perform snce 
The previous section emphasized the esttmation and op- 
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Fig, 9. Examp/e ^f B cachB miss 
rate measurement for the HP 9000 
Model 840. an HP Precision pro- 
cessor^ (a) Instfuctfons. (b) Data, 
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tiraization of the CPU t:ai]acity as measured in millions of 
instructions executad per second. As pointed out in that 
section, an increase in MIPS does not necessarily guarantef? 
a similar increase in syslern performance measurers, such 
as an increase in sysleui throughput or a decrease in user 



response time- Examples of reasons why system [lerfor- 
mance may not increase proportionally to MIPS are seriali- 
zation on software queues created to guarantee data consis- 
tency (locking or lattJiing), or an input/output subsystem 
not designed to support the increased number of users that 
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an increase in the capacity of tlie central processing unit 
woufd dictate. 

It has been pointed out that something in a computer 
system has to be a bottleneck since a perfectly balanced 
computer system is probably impossible to achieve.'' It 
would .seem to be common sense that the be.st choice of 
the component ol^ the computer system selected lo he the 
bottleneck would be the most expensive component. To 
choose any other would be to waste the most expensive 
component and wfould not maximize performance and 
minimii^e cost. Usually, the most expensive component ot 
the computer system is the CPU. hence the choice of MIPS 
as the metric of computer system performance. But. it 
should be reali?:ed that it takes n lot of hard work to make 
the CPU the bottleneck in all but the simplest systems. 
This situation presumes a highly-tuned operating system* 
data communication system, data base management .sys- 
tem, and set of applications. Therefore, although it can be 
understood why MIPS is so often used as the performance 
metric of choice, this measure must be tempered by other 
measures. 

An effort was begun in HP Laboratories in 1982 to charac- 
terize the environment in whieh HP computer systems were 
operating. This study concentrated on the characterization 
of HP 3000 and HP lono installations, since the HP 9000 
was too new at the time to characterize clearly. Fig. 11 is 
a synopsis of the type of data gathered by measurements 
of actual HP customer installations (in this case an HP 3000 
installation). From this data. HP engineers have created a 
set of workloads used lo characteri^^.e the high end of the 
HP 3000 and HP 1000 environments. From these model 
workloads, benchmarks have been created to study system 
performance by measurements of the new software compo- 



POO 



Fig. 10, An HP 9000 Model 3^0 
MIPS performance measurement. 

nents of the computer systems based on HP Precision Ar- 
chitect tire. Fig. 12 is a high-level view of this process. 

It must be mentioned that by "w^orkloads ' is meant writ- 
ten descriptions of snapshots of actual installations. Fig. 
11 is such a snapshot. This form of description of a com- 
puter system is useful for analytic and simulative modeling 
of possible future alternatives based on this computing 
environment. Reference further describes the process of 
gathering and using workloads in systems performance 
sliidies. Reference 7 is excellent in its depiction of system 
models. 

Estimation Using Workload Data 

A very simplf^ exneiple of how systems throughput rati 
be estimated using data from workloads is as follow^s. 

Let us predict the throughput of a proposed computer 
system with a new COFKl!. compiler and a data base man- 
agement and file system redesigned for increased system 
Ih rough put. Assume that the current DBMS (dala base man- 



Site: Sample 



Period Of Observalton 3600 seconds 
TolalCPU 1971 seconds 





Transactions 


3!05 




Percent of Dynamic Path tengtti 




Oaia Base: 
File System: 
10: 


10,43 OS Kernel: 
12,82 MIsc: 
33J6 User: 

I/O Informatlort 


26.48 
6,11 
10.4 


Total! 0: 
DisclO: 


134940 Data Base 10: 
§1238 Non Data Base 10; 
43702 


39611 
51627 



Fig, 1 1 - Data gathered frorr^ an actual HP 3000 installation. 
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agement system) has an emerging bottleneck for systems 
with increased CPU power btTcause of seriatization on its 
buffi^r pool, and that the new design alleviates this 
bottleneck, but costs more in instructions* Assume that the 
computer system that this system is to replace has the 
workload characteristics shown in Fig. 11 and that the 
central processing unit of the current system has one fourth 
the processing ability [MIPS] of the new system *s CPLl 
Estimates show that the new DBMS and file system must 
execute approximately t%vice the number of hislructions 
(dynamic: path length] to achieve the same transaction rate 
as the current DBMS and file system. Abo, the new COBOl, 
compiler is assumed to hsive a lO^^^i reduction in dynamic 
path length. What is the throughput of the new system 
relative to the current system? 

The following simple model is indicative of the tech- 
niques HP design engineers are using to evaluate the kinds 
of options depicted above. Using Fig, 11. let us define sev- 
eral terms (using the techniques outlined in reference 6): 

P = time of obser\'alion of the system under measurement 
G = seconds the CPU is active during the measurement 

period P 
T— number of completed transactions observed during 

period P 
A = transactions per second ^ T/P 
^ = seconds of CPU time per transaction == C/T. 

System PerfOfrTrance Cycle 



Create Update Measurement Tcjols 



Obtain Customer Data for Citrreni Systems 



Create or Up<fate Market Cetl Workloads 



Create or Upcfate Market Cell Benchmarks 



Establish Baselir^ from Current Systems 



Establish Subsystem Ferformarice Objectives 
for New Subsystems and Products 



Model (with Obtectives) New Systems 



Project System Performance 
Based upon Objectives and Measurements 



Measure New System 



Track Actual Peflormance 
vs. Obiectives 



Objectives Not Met: 
Redesign and Recode 






Objectives Met: 

Update Benchmarks for New Products 

Run Benchmarks on New System 

Update Capacity Planning Support Toois 

^ Ship Products 

(with Performance Specifications) 

. Provide Customer Support 

(Tuning and Capacity Planning) 



Fig. 1 2. The process used to dBtefmine system pefformarfoe. 



If we assume that we have a uniprocessor CPU systenit 
ih^n p^:|,Li ^ KO = OP = CPU utilization. 

If we assume that the new .svHN^m is operated a I the same 
CPU utilization as the olfl one (to ke<?p response times 
roughly the same, for example] then 



That is, the ratio of transaction throughput of the new to 
the old system is equal to the ratio of the old CPU seconds 
to execute T transactions to the new CPU secunds to execute 
T transactions, or the ih rough put ratio of the new system 
to the old (for approximately the same response time] is 
inversely related to the ratio of CPU times to execute the 
same number of transactions. The value of the original CPU 
seconds to execute the ohserved T transactions i n P seconds 
is made up of the components shown in big. 11. 

For example, the data base component from Fig. 11 is 
205.6 seconds, or 10.43% of the 1971 seconds that the CPU 
is active during the ohservation period P. In like manner, 
the file system component in the current system of Fig. 1 1 
is 252,7 seconds. The 1971 seconds of active CPU time 
during the sample period of 3600 seconds recorded in Fig. 
11 is therefore made up of six component parts: the DBMS 
subsystem, the file system, the low-level 1/0 subsystem, 
the operating system kernel, the user application code 
[which we know is written in COBOL), and the effect of 
direct terminal connection, which » although not shown in 
Fig. 11, is represented in the 1/0 counts shown in Fig. 11 
as "oon data base.*' 

If the system software making up the current system were 
perfect and allowed increasing levels of multiprogramming 
and multiprocessing without penalty, then the new com- 
puter system under consideration would have four times 
the throughput for a comparable responjie time as the old, 
since the only limiting factor in this "perfect'' computer 
system is the power [MIPS] of the system CPU. However, 
an i n crease in CPU power of a factor of four from the current 
system would, for this example, allow an increase in data 
base throughput of only 10% because of the serioos seriali- 
zation mentioned above. The factor of two increases in 
d^mamic path lengtli for the new DBMS and file system 
seem to be a high price to pay for increased throughput, 
however- het uh ose the data in Fig. 11 to test the sensitivity 
to this .supposition. 

For the new CPU and software, the 1971 seconds is re- 
duced to 1971 seconds •> 4 -492.8 seconds because of the 
increased processing power of the new CPU. However, the 
COBOL compiler costs less in computer time and the DBMS 
and file system cost more. So, the actual figure is 492.8 + 
(205,6 + 252. 7)/4 - [0,l)(205)/4 = 612,5 Seconds. 

Consequently, the new system lias a throughput relative 
to the old of 1971 seconds/612.5 seconds or 3.22:1 instead 
of the expected 4.0:1. However, the current design of the 
DBMS and file system would evolve into an increase of 
only 10% for an increase of 400% in CPU power, whereas 
the new system design allows 3.22/4.0 or 81% of the raw 
CPU power to be realized. This particular example points 
out some not so obvious factors in computer system de- 
signs: 
■ Transaction throughput may not track linearly with the 
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power of the system CPU. 

■ One may have to execute more instructions in some com- 
ponents of the software system to realize the power of 

an enhanced processor's capacity-. 

■ Don't spend disproportionate time in tuning a compo- 
nent thai doesn't have much affect on system perfor- 
mance (such as the COBOL compiler in the above exam- 
ple). 

Estimation Using Benchmarks 

The benchmark, as distinguished from the workload is 
a set of programs, data, and user interactions with the sys- 
tem that simulates an actual computing environment. The 
simpler benchmarks, such as Whetstones or Unpacks.® at- 
tempt with one batch program to depict a diverse multiuser 
environment %vith possibly hundreds of active users. We 
have fek that such benchmarks are unreaUstic and have 
extended benchmarking to include realistic benchmarks 
that model actual computer installations. These benchmarks 
are driven by test setups that use terminal simulators » and 
a system executing interactive and batch benchmarks. 

Two benefits have been derived from the HP Laboratories 
study of customers' use of HP computer systems. One is a 
data base of customer measurements like those shown in 
Fig. 1 1 upon which we can base workloads and benchmarks, 
and the other is the formalization of the tools gathered to 
create this data base into tools and services that are being 
sold today, such as HP CapPlan. HP Snapshot, and HP 
Trend. The invaluable information gathered by this effort 
has allowed us to profile a large portion of HP's customer 
set. Our workloads and benchmarks, as a consequence, are 
much more complex than the industry standards such as 
Whetstones and Lin packs. To measure system perfor- 
mance, HP development engineers have had to integrate 
performance measurement tools into the software and 



hardware of the HP Precision Architecture family. The 
analyzer board and the HP 64000 are examples of such 
instrumentation. Further examples include softw^are in- 
strumentation that parallels but extends that familiar to 
the users of HP 3000 systems and new instrumenlatlon for 
the HP 1000 and HP 9000 user. 
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The HP Precision Simulator 

Designed for flexibility, portability, speed, and accuracy, 
the simulator is useful for both hardware and software 
development. 

by Daniel J. Magenheimer 



THE HP PRECISION SIMULATOR is an internal tool 
used heavily throughout the research, design, and 
development stages of HP Precision Architecture and 
its systems software. In addition to functional iy simulating 
the architecture at the instruction set level, the simulator 
provides an interactive screen-oriented machine state dis- 
play and user interface, a combination that makes it particu- 
larly usefiM in compiler and operating system development. 
The simulator can also be used as a perfurmance evaluation 
tool, since it can easily model different hardware imple- 
mentations and record statistical results for comparison. 
Finally, the ability to set code and data breakpoints, change 
registers and memory locations, and record branch his- 
tories makes it an effective assembly-level debugger. 



to simulate itself. 

Speed and accuracy were the most important require* 
ments. Simulation speeds of thousands of instructions per 
second were necessary to provide timely feedback for per- 
formance measurement, but since the simulator was the 
only implementation of the instruction set before hardware 
prototypes became available, complete and accurate simu- 
lation of each instruction was mandatory. Of course, these 
goals often conflicted. For example, top speed can be ob- 
tained by coding in assembly language, but then portability 
is lost. Nonetheless, a reasonable simulator was created 
which has all of the desired characteristics and satisifies 
the needs of a large class of design and development en- 
gineers. 



Development 

The goals in the development of the simulator were four: 
flexibility, portability, speed, and accuracy. Flexibility was 
truly a requirement — in the early research phases of the 
architecture, instructions were added and deleted nearly 
on a weekly basis and bit fields were shuffled frequently. 
Timely results were necessary to evaluate the new instruc- 
tion sets, so changes to the simulator were frequent. Porta- 
bility was also needed, because simulations were done by 
design engineers in their daily work environment as well 
as batched on powerful mainframes. To this end, all coding 
was done in the high-level C language and use of library 
routines was limited to those present in the portable C 
lib^a^^^ This design choice alloxved ports to several HP 
machines, two Digital Equipment Corporation machines, 
and an Amdahl mainframe, and even allowed the simulator 



User Interface 

The simulator presents a screen containing four nonover- 
lapping window^s (see Fig. 1). One of these windows is 
used for user command entry and message reporting. The 
three other windows contain useful machine state informa- 
tion as follows; 

■ The register window shows the contents of either of two 
sets of registers — the general registers or the space and 
system control registers — all in eight-digit hexadecimal 
format. 

■ The program window provides a nine-instruction view 
into the program space. Each line of the window contains 
the address of a word, both in hex and symbolically, the 
word itself (in hex), and a symbolic disassembly of the 
instruction. The first two columns indicate whether a 
breakpoint is set at that instruction and where the pro- 



ister General 



rO / 000 CO 000 00000000 00000000 00000000 

r8 / OOQOOOOO 00000000 00000000 00000000 

rl6 / OOOOOOOO 00000000 OOQOOOOO 00000000 

r24 / OOOOOOOO OOOOOOOO OOOOOOOO OOOOOOOO 



OOOOOOOO OOOOOOOO OOOOOOOO OOOOOOOO 

OOOOOOOO OOOOOOOO OOOOOOOO OOOOOOOO 

OOOOOOOO OOOOOOOO OOOOOOOO OOOOOOOO 

OOOOOOOO OOOOOOOO OOOOOOOO OOOOOOOO 



pc - OOOOOOOO. OOOOOOOO priv ^ 

> OOOOOSQO ^- ^STARTS / 236OOS0O 
00000804 = $STARr$+0004 / 377bO00a 
OOOOOBOa = $START$+000e / 34020002 
0000080c = $START$+00Dc / 00027S20 
00000810 = SSTART$+0010 / 23cOOSOO 
00000814 ^ $START$+0014 / 37de0428 
00000818 = SSTAra'$+OOia / b7deO00e 
0000081c = SSTAKTS+OOlc / d7cnicld 
00000820 =^ $STAB:rS+0020 / Ofc012a8 



Data (Space: 00000001) 



psw = 


O004000e sar = 




LDIL 




0x40000000,27 


LDO 




4(27) ,27 


LDO 




1(0) ,2 


MTSP 




2,5 


LDIL 




0x40000000,30 


LDO 




532(30), 30 


ADDI 




7,30,30 


DEPI 




0,31,3,30 


STpvS , 


MA 


0,4(0,30) 



* pj SS TARTS 



OOOOOOOO OOQOOOOO OOOOOOOO OOOOOOOO OOOOOOOO OOOOOOOO OOOOOOOO 
OOOOOOOO OOQOOOOO OOOOOOOO OOOOOOOO OOOOOOOO OOOOrmoo oormoooo 



Fig. 1 . The HP Precision sim uiator 

screen has four wrndows showing 
commands and messages, regis- 
lers, program fines, and data. 
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gram counter is currenUy positioned. 
M Sixteen words are presented ixi hexadecimal format in 

tbe data ivindow. Eacli line is preceded by an address. 

possibly sjTTibolic. 

When the four-window format is too restrictive, for exam- 
ple when displacing lists or tables of information, the screen 
is cleared and information is presented a screen at a time, 
prompting the user to hit a carriage return to see the next 
SCTeenful. 

The screen and windows can be manipulated for fine 
adjustments or major changes of context with simple com- 
mands given in the simulator command language. The com- 
mand language is simple enough that a beginner can pick 
up the essentials quickly, but extensible so that the ad- 
vanced user can accomplish tasks with a minimum of key- 
strokes. Each command is given with a set of arguments 
(sometimes optional]- Frequent commands can be ab- 
breviated, and a command can be repeated easily. For 



= assign value to register/ data address 

sir display assist registers 

bs set breakpoint 

bD delete all breakpoints 

bd delete breakpoint n 

blist list breakpoints 

cpntin c:ontinue prograin 

:e^ get indirect files from given directory 

^ juinp to specified address in ddta window 

db move data window backward n woriis 

df move data window forward n words 

dbD delete all data breakpoints 

dbd delete data breakpoint n 

dblist list data brea>:points 

dbs set data breakpoint 

disasm duinp rneimory dis^ssGEtkbled as instructions 

do execute indirect command file 

dascii enter/exit ascii mode in data window 

dstack enter/ ex it stack mode in data window 

gr display general registers 

grclr clear all general registers 

jnl open journal file 

load load executable file from dis 

macro define macro 

macdel delete (pop) macro definition 

maclist list Ksacros 

ntemduinp dump memory in hex format to file 

pj jump to specifiejd address in program Window 

pb move program window backward n words 

pf move proqrara window forward n words 

page create/change access protection for page 

guit quit {?!] 

run run prograni 

redraw redraw screen 

reglist list registers to file 

step execute single [or n) instruct ion (s) 

save save simulator status in file 

space create/change bounds/protect of specified space 

sr display special registers 

stack display stack trace 

stats print statistics for most recent run 

stop stop execution of program 

symbol define s^Tiibol 

symdel delete [pop J symbol definition 

symlist list symbols 

tblist list last few taken branches 

trace generate address trace to file 

update update screen 

\ pASB cortmiand string to system 

# convert bex to decimal 

< take application input from file 

> write application output to file 

.goto goto label 

*if conditionally execute rest of line 

: target of goto (and else in ,if-else) 

; ignore line (comment) 

Fig, 2. The stmuiatOf pro'^ides a help factiify for quick refer- 

mce. 



example. 



displays the space and system control registers in the regis- 
ter window^ while 

* Pf @(%main ^ 0.4) 

repositions (jumpsj the program window to the first word 
(four bytes) following the symhol main. 

A help facility is also provided for quick reference (see 
Fig. 2). 

Program Simulation 

As the name implies, the primary function of the 
simulator is to simulate HP Precision Architecture instruc- 
tions and programs built from these instructions. To ac- 
complish this function, several commands and features 
provide the ability to load and execute programs. These 
capabilities are complemented by a full statistics gathering 
mechanism. 

Although small programs can be entered entirely by 
hand, it is much more efficient to be able to load a binary 
program from the underlying file system. The binar>' object 
file contains sufficient information to allow the simulator 
not only to load the program and its data, but also to bui^ld 
a symbol table and initiali^se the screen and program 
counter properly. 

There is an additional problem: when an opemting sys- 
tem loads a program and prepares it to run, it must map 
the virtual addresses of the program to physical memory 
local ionSi determine program protect! on h and enter this 
information in internal data structures, On the simulator, 
there is no operating system to perform these tasks. It must 
do the mapping and information storage itself by making 
assumptions about the program it is loading. These assump- 
tions can be overridden or completely determined by the 
user, but reasonable defaults are selected which are gener- 
ally sufficient. 

Programs can he. run from start in finish without inlerrup- 
tiont stopped at appropriate places and continued, or 
single-stepped for debugging or educational purposes. In 
any case, the effect of each instruction is completely and 
accurately simulated and statistics are gathered. For exam- 
pisp 



invalidates entries in the cache and TLB {translation 
lookaside buffer), resets statistical counters, and starts 
execution of the program, and 

* contin 

starts the program without any inltializatjan [for example, 
after encountering a breakpoint). 

* step 100 update 

single-steps 100 instructions, updating the screen follow- 
ing each^ while 
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' step -if !%r1 slop 

continues stepping forever until general register 1 is equal 
to 0. 

In addition to virtual memory mapping, the simulator 
must also provide other functionality normally associated 
with an operating system. All interruplions must be ap- 
propriately handled. For some, such as the TLB miss faults, 
reasonable action takes place to correct the problem and 
prograni execution continues. For many others, such as 
insufficient privilege traps, a program bug is indicated. The 
simulator notifies the user with a descriptive message and 
Stops the program. 

Another example of required functionality beyond direct 
instruction simulation is I/O, Even the simplest bench- 
marks require reading from a file or writing to the screen. 
To support this, the simulator recognizes certain pseudo- 
instructions as KP-UX system calls. By recoding the system 
library to use these pseudoinstructionsn it is possible to 
run many large programs (including the simulator itself) 
and measure the user component of their performance. 

Debugging Features 

To be able to observe changes in machine state as one 
steps through a program is often sufficient to debug a pro- 
gram ^ but more sophisticated features are always helpful. 
The simulator allows assignment to any register or memory 
location at any point in the program execution. This can 
be used to patch or skip portions of code, change data or 
parameters to procedures, and simulate external events 
(e.g., interrupts). In addition, a large set of internal simu- 
lator variables can be changed to modify the behavior of 
the simulator or remember important values. 

Another useful feature is the ability to set code and data 
breakpoints. Code breakpoints are marks within the execut- 
able code of a program that cause execution to be halted 
when they are encountered in the normal flow of a running 
program. When a breakpoint is hit, the program stops and 
control is returned to the user at command leveL Data 
breakpoints can be viewed as temporary access restrictions 
on a region of data- Access of data within the region causes 
a running program to halt at the instruction that attempted 
the access. The region can vary in size from one byte to an 
entire space and can be specified to cause a break either 
on writes or on both reads and writes. Commands are pro- 



— Last 10 taken branches — — 
Delay slot address Branch target address 



00000000. 
00000000. 
00000000. 
00000000. 
00000000. 
00000000. 
00000000. 

oooooooo. 
oooooooo. 

00000000. 
RETURM to CO 



000OOB54 
00000858 
OOOODBab 
00000923 
00000957 
0000092f 
OOOODSb? 
00000S63 
00000973 
OOOOOSab 
ntinue. . . 



OOQOOOOO. 
OOOOOOOO , 
00000000, 
OOOOOOOO, 
OOOOOOOO. 
OOOOOOOO, 
OOOOOOOO. 
OOOOOOOO. 
OOOOOOOO. 
OOOOOOOO, 



00000858 
00000873 
000008bb 
00000933 
00000927 

oooooeaf 
ooooossb 

0D0O095b 
0000087 3 
OOOOOSbb 



vided to set, delete, and list current code and data break- 
points. For example, 

' bs %print 

sets a code breakpoint at the memory location associated 
with the symbol print. 

' bs %fOO 50 .if !%r2 = %uO %uO+ 1 

sets a code breakpoint such that the program will stop only 
on the 50th time that the instruction at the beginning of 
the too procedure is executed and will count how many 
times [out of 50) that general register 2 ts equal to at that 
point, recording the result in a user temporary register. 
Another common bug that can be detected with the help 
of the simulator is the wald branch, a branch that has a 
false target far outside the program, or worse, a target at a 
random place within the program. When this happens, the 
simulator can provide a list of up to the last twenty taken 
branches to assist in determining what went awry (see Fig. 
3). The Simulator can also display the current stack trace, 
a list showing what procedures called the current proce- 
dure, along v^rith parameters and local variables. Finally, a 
command can be given to dump a region of memory to a 
file, either in hex or in disassembled instruction format. 

Performance Analysis 

Before hard ware became available, the simulator was the 
only tool capable of analyzing performance on a native 
instruction stream. As mentioned above, statistics are col- 
lected during progiam execution to provide cycle count 
and instruction distributions. Often, however, performance 
issues go w^ell beyond the instruction set. To this end, the 
simulator is equipped with a large set of parameters and 
flags which allow a performance engineer to analyze differ- 
ent cache and TLB sizes. Operational characteristics such 
as cache and TLB miss overhead and replacement al- 
gorithms can be analyzed, and various memory delays and 
interlocks can be modeled. 

Using these parameters, studies were done to estimate 



Measured 



Simulated 



E 

Q. 



HO" 



100-^ 




Fig, 3. Wild branches can be detected \fifith the he!p of the 
taken branch fist, a tfst of up to twenty of the last taken 
trenches. 



Fig. 4. Simulator estimates compared with actual machine 
measurements - 
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the benefits ol cache control hints, critical-word^first cache 
algorithms, and two-level TLBs and to compare the perfor- 
mance of a small instruction cache against an inslmction 
lookahead buffer. Other studies accurately estimated the 
performance of different implementations before they were 
built. Fig. 4 compares simulator estimates with actual 
machine measurements. 

Lastly, the simulator can provide Instruction execution 
traces to feed into other present and future performance 
analysis tools. 

Miscellaneous Features 
Many features are provided to assist the advanced 



Remote Debugger 



RDB \s a remote debugger, a tool that runs on an HP 9000 
Model 220 Computer (the host machme) but allows mantpulation 
of programs on a different (remote) machine This capabihty is 
especially imporlani in the early stages of testsng a new machine 
I m piemen tatton and developing and bringing up an operating 
system on it, The tool has been used extensively for these pur- 
poses RDB consists ot three major components a user interface, 
an (ntermachrne interface (including hardware and software) 
and a small software monitor which runs on the remote machine 

The user interface was extracted from the HP Precision 
simulator and, except for a few minor differences in the lisi of 
commands, an inexperienced user would be hard-pressed to 
tell them apart Besides leveraging thousands of tines of code. 
this approach minimized the teaming etfori for engineers who 
used both tools. As with the simulator, registers and memory 
locations can be changed and code breakpoints can be set 
(data breakpoints are not supported) The same interruption han- 
dling and HP-UX system call support as in the simulator are 
used, thus allowing large programs to be run and measured on 
new hardware wjthout operating system support 

The inter machine interface consists of two I/O drivers, one on 
the host system and one on the remote system, that communicate 
through a GPIO 16-bjt parallel card The host controls the com- 
munication by issuing a small set of commands read a vanabi© 
amount of data from a physical address, WRITE data to an ad- 
dress, and lock and release semaphores, The remote processor 
IS started by writing to a continuation flag at a fixed memory 
location, and is stopped by asserting an external interrupt 

The monitor is a small (less than IK bytes) operating system 
subset that catches interrupts, traps, and faults and notifies the 
host processor ol the type of interruption Since the host proces- 
sor can only read from and wnie to memory, not registers, the 
monitor rs also responsible for saving the machine state in mem- 
ory and restohng it on continuation. 
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= $u29 ; Put stack marker unwind back to Q, 

.if $u2S*d0d0caca *gotei trans 

= $u5 ((rl2-rl3)/2) ; Delta P. 

- Su3l Su5 ; P for display - 

^goto stat 

: trails 

= §u5 pc ; translator pc, 

: Stat 

= $u4 rS&ffff ? STJ^TOS. 

= $\i30 Su4 ; STATUS for display. 

dj — sr5.{r3-6) db ; Show stack around Q* 

.if $u28"d0d0caca .goto m2 

= $ciikdfecho$ 1 ; Emulator 

do cmcurm,ss 

• goto showein 

: m2 

= $cif>dfecho$ 1 ; Translator 

do cmcuirtni.ss 

: showein 

t $u30 

•if $u28"dOdOcaca .goto done : # $u31 

; done 

Rg. 5. An example of a sifnuiator command program, (Cour- 
tesy of Torjy Hurjt.) 

simulator user. Indirect command files can be executed to 
avoid repetitive command sequences. These flies can lie 
nested, can be commented, and can contain tf statements 
and goto statements which raise the command language to 
the power of any high-level programming language [Fig* 
5), Other commands provide for saving and restoring of 
the simulator's state (so a session can be resumed at a later 
time), recording of command sequences in a journal file, 
macro definition and use, and an HP-UX shell escape. 

Progeny 

Work on the simulator influenced the development of 
several other tools. Foremost among these is RDB (see box), 
a remote debugger, which is still being used for low-level 
operating system and ID development and booting of newly 
developed hardware implementatLons. Theinstruclion dis- 
assembler component of the simulator has been extracted 
and used in HP-UX's assembly language debugger adb and 
in the high-level language debugger xdb. The user interface 
has been borrowed for a hardware support monitor and for 
the MPE-XL native and compatibility mode debuggers. 
Other work is in progress to extend the simulator to handle 
multiprocessing configurations. 
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Editar: 

I read about the Spectrum prograJii hi a recent issue of the 
HP Journal ("Compilers for the New Generation of Hewlett- 
Packard Computers," January 19B6J. In thai article you talked 
about the primiliv^e instruclions that will be accepted. You 
wrote that thi>* inslructious that reference only registers are 
faster than Lhose that impiy access to memory. 

You said that among the instructions that take more than 
one cycle in I heir execution are the loads. The order of execu- 
tion of an operation, isuch as addition, might be: 

1, Load first operand, a. in register p; 

2, Load second operand, b. in register r; 

3, Perform the operation bet\\-een registers p and r; 

4, Store tli£> n^sull if tiiicessary. 

Why is it nat possible Id have an instruction that loads both 
operands at the same time^ in one step? Tliis implies a tlual 
access to memory in reading oniy, [The cases that wnuld imply 
dual writing to memory are very rare) With such an instruction, 
the order of execution now is: 

1 , Load, by dual access, the two operands a and b in registers 
p and r; 

2, Perform the operation between registers p and r; 

3, Store the result if ne€essa^l^ 

This way the computer saves a step, since it takes only a 
single step lo load the operands in the tvv^o registers. 

0. ing. Dejan Claud 
Maramures, Romania 



As you noted, some in«,'t ructions ere perm if ted to take longer 
fhan one cycle to comphUi, In most implementations, however, 
LOAD executes m a singJe cyde. ^ddifional cycles may be 
required if ihe next sequeniiai instruction refers to the register 
haded by the load. This is an "interlock" s/tuotion that couJd 
take between one cycle fi/ the data is olr(iGdy in the cachfil 
and many cycles |i/the data must bt fHcht^d from main mem- 
ory' |. The /osier case is usually scheduled by the optimizing 
compilers in suuh u way thai the register loaded is not refer- 
enced by the tmrnediateJi^ following instruction. In I his Wfiy, 
the inU^.rlock is avoided, and useful computoii<m is pf^rfarmed 
in parallel wiih completion of the LOAD, 

You ask why it is not possibh to implement an instruction 
thot loads two operands to two registers at the some time, Jn 
foet, LOADS are hss frequent than your example suggests. Fre- 
quently rme or more operands are already present in regis(:ers* 
and do not require access to memory. 

To carry out simultaneously all of the actifjns required to 
implement a "double loadn" considerable additional hardware 
would be required. Providing/or duol access to the cache would 
require I he dupJicatirm of essentially the entire cache and vir- 
tual address translation hardwore. Memory systems are inher- 
ently serial devotees* because all memory elements share f com- 
mon addressings checking, and control hardwfjre. tl is possible 
to design memory systems that are truly duol-porled (not ;ust 
a "multiplexed" single port), but the cost in decreased speed or 
capacity is considerable. 

Of eours^^ tf) fully support "doubie load," (he addn^sa and 
data buses, the effective address odder, and the datrj-oddress- 
ing and register-speci/ying couieni of the Instruction itself 
would all have to be duplicated. In short, the architecture 
supports the highest-performance LOAD that is commensurate 
with a high-speed implementation. Any additional function as- 
sociated with LOAD would increase cost more than it increased 
performancen and so Wi>uld bfj inerlvisoble. 

A key f.-hnrnetensfic of HP Precision Architecture is that, 
unlike most microcoded machines, the perfonnanoe of im- 
plementations is limited by the hard w ore's ability [n perform 
the requested operation, not by the control unit's ability to 
decode instructions, specify registers, and sequence signals. 
The best that one can hope for is that all for much] of the 
machine's hardware can be kept busy on eoeh cycle by most 
programs. In general, this ohjectivx* is better serv^ed by" simpler 
hardware configurations than by complex ones. 

Michael J. Mahon 

Manager, Computer Languages Laboratory 

Information Technology Group 



Address Coffection Requested 
Hewlett-Packard Cotmpany, 3000 Hanover 

Street, Palo AJlo, California 94304 



HEWLETT-PACKARD JOURNAL 



Atjgust 1386 VaTtmw 37 • Number 8 

Tochnlcal Informntion frorn th« Lt^rttOftM of 

H«wl«tl-P>ck«r(f Company 

^lew}erE'PsQKa^^ Cornpany. 30OO Mainover Slfgei 

P«lo AllD. Cattfomla 94304 USA 

Hevdflti-Pacfe&fd C-emrai Mailing D9i;>a.Ti7nent 

PO Bq« 529 Staritiaan 16 

1 180 AM /^Ti$teiveen The t^heftaods 

Yol(ogawa-Hew«eit'Paehaf<J Ud . SuQinami-Ku Tokyo i€8 iapin] 

HewteB-PackBfCJ (C 

0F? Goimf^Y Dfiv0« Ml«st&sau^. < 



Bulk Rate 

US- Postage 

Paid 

Hewlett-Packard 

Company 



/^ I J A K I / '^ l__ /^"\ i__" A i \ i~\ r^ r" :j(^ ^fc , TosijtjscFtbe change your adclres^i, or deigjfGvsoHjr rtameffcmouf mailing li&t send your rsquesi: fQHewJett- Packard 
O M r\ I N Kji t \^ V r\ LJ LJ 11 t O O * Joumal 3(Xro Kanovef Street. Palo Alio CA 94304 USA Incfutfe vout 0^6 adcfress t£jbo>. H any Allow BO days 



595^5551 



)Copr. 1949-1998 Hewlett-Packard Co. 



